Zakres analizy¶

Obowiązkowe elementy:

  1. Oczyt danych
    • Poprawne załadowanie danych ze źródła internetowego do ramki danych, z uwzględnieniem nagłówków, kodowania zbioru, separatorów itd.;
  2. Wymiary, oszacowanie czasochłonności rozmiaru analizy
    • Poznanie rozmiaru zbioru danych (liczby obserwacji i liczby zmiennych, które je opisują)
    • Oszacowanie czasochłonności procesu analizy;
  3. Rozeznanie struktury, próbka danych
    • Wyświetlenie próbki surowych danych w celu wyrobienia sobie wyobrażenia o nich – poznania struktury danych i wstępnej oceny przydatności poszczególnych zmiennych;
  4. Weryfikacja typów danych
    • Weryfikacja typów poszczególnych zmiennych (całkowite, zmiennoprzecinkowe, kategoryczne porządkowe, kategoryczne nominalne, zmienne typu logicznego, daty) i ich ewentualna korekta (zamiana typu string na float, interpretacja zmiennych numerycznych jako kategorii itp.);
  5. Anailiza danych:

    • Zbudowanie podsumowania zmiennych numerycznych opisujących zbiór, w postaci jednej tabelki, zawierającej podstawowe informacje, takie jak:
      • wartości minimalne,
      • wartości maksymalne,
      • średnia,
      • mediana,
      • drugi (dolny) kwartyl,
      • trzeci (górny) kwartyl,
      • odchylenie standardowe,
      • liczba danych brakujących lub nienumerycznych.
    • W tym kroku należy również dokonać analogicznej analizy zmiennych - kategorycznych, dającej dla każdej z nich informacje m.in. takie jak:
      • liczby poszczególnych kategorii i ich liczności,
      • wartości najczęściej występującej i częstości jej występowania,
      • liczba wartości unikalnych,
      • liczba braków danych.
  1. Sprawdzenie ilości NaN, sprawozdanie

    • Sprawdzenie, czy w zbiorze występują braki danych. Należy sporządzić odrębne podsumowanie, skupiając się na poszukiwaniu brakujących wartości w zbiorze – Pozwoli to Państwu odpowiedzieć na pytanie, jakie zmienne zawierają braki i jaka jest ich liczba, z czego mogą one wynikać itd.

    • Etap ten (wraz z poprzednim) pozwoli Państwu odnaleźć błędy w danych – brakujące wartości, błędne interpretacje rodzaju zmiennych itp. Da również wskazówki, które atrybuty wybrać do analizy (pod kątem ich istotności dla przewidywań modelu), czy i jak uzupełnić brakujące dane (ewentualnie usunąć wiersze/kolumny, zawierające zbyt wiele braków danych), dokonać ich transformacji itd.

  2. Wizualizacja
    • Wizualizacja rozkładu (wybranych) zmiennych (zarówno numerycznych, jak i kategorycznych) poprzez histogramy i próba ich scharakteryzowania (np. poprzez ich skośność i kurtozę)
    • będzie to pomocne np. w procesie imputacji (uzupełniania) zmiennych numerycznych;
  3. Czyszczenie danych
    • Przeprowadzenie czyszczenia danych, obejmujące m.in.:
      • uzupełnienie brakujących danych (np. wartością stałą, średnią/medianą/modą dla całego zbioru lub dla podzbiorów według kategorii, poprzez interpolację itp.), usunięcie wierszy/kolumn, zawierających zbyt wiele braków danych,
      • przycięcie odstających wartości (ang. outliers) – pomocne będą m.in. takie techniki, jak wykres punktowy (gdzie nanosimy na obu osiach ten sam atrybut) lub wykres pudełkowy i ewentualna normalizacja danych numerycznych (metodą min-max lub Z-score) – niektóre algorytmy modelowania danych są wrażliwe na punkty odstające (np. metody regresji liniowej, korelacja Pearsona) czy różnice w zakresie zmienności poszczególnych atrybutów (niektóre algorytmy klasyfikacji bądź grupowania);
  4. Zbadanie zależności, korelacje
    • Zbadanie zależności pomiędzy zmiennymi – krok ten pozwoli odkryć związki pomiędzy poszczególnymi zmiennymi; informacje te mogą także zostać użyte, np. na etapie transformacji zmiennych lub do podjęcia decyzji, które zmienne wybrać do budowy modelu:
      • obliczenie macierzy korelacji (można użyć współczynnika korelacji rang Spearmana lub współczynnika Pearsona) pomiędzy zmiennymi numerycznymi i zwizualizowanie ich za pomocą wykresów punktowych (ang. scatter plots) lub tzw. wykresów par zmiennych (ang. pairplots),
      • ewentualne zbadanie zależności pomiędzy zmiennymi kategorycznymi (współczynnik V Cramméra) i zależności pomiędzy zmiennymi kategorycznymi i numerycznymi (współczynnik R modelu liniowego z jedną zmienną kategoryczną, która objaśnia zmienną numeryczną) oraz (podobnie jak powyżej) zwizualizowanie tych zależności w formie wykresów;
In [ ]:
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
import textwrap
import geopandas as gpd
import json

import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objs as go

%matplotlib inline
In [ ]:
pip install -U kaleido # do wyświetlania wykresów w formacie .svg zamiast interaktywnych
Requirement already satisfied: kaleido in /usr/local/lib/python3.10/dist-packages (0.2.1)
In [ ]:
import plotly.io as pio

GeoJSON i mapa¶

In [ ]:
gdf = gpd.read_file("/content/drive/MyDrive/Analiza danych python/Ćwiczenia/AirBnB/Barcelona/neighbourhoods.geojson")
In [ ]:
gdf = gdf.reset_index()
print(gdf.shape)
gdf.head(1)
(75, 4)
Out[ ]:
index neighbourhood neighbourhood_group geometry
0 0 el Raval Ciutat Vella MULTIPOLYGON (((2.17737 41.37532, 2.17852 41.3...
In [ ]:
geojson = json.loads(gdf.to_json())
In [ ]:
fig = px.choropleth_mapbox(gdf, geojson=geojson,
                           locations=gdf.index,
                           color='neighbourhood_group',
                           hover_name='neighbourhood',
                           hover_data={'neighbourhood': False, 'neighbourhood_group': False, 'index': False},
                           featureidkey='properties.index',
                           mapbox_style='carto-positron',
                           center={'lat': 41.3933, 'lon': 2.1734},
                           zoom=11,
                           opacity=0.5)

fig.update_layout(
    height=800, template='plotly_dark', bargap=0.05,
    title='Mapa dzielnic Barcelony')

fig.show()

1. Wczytanie danych¶

2. Analiza wymiarów¶

In [ ]:
df = pd.read_csv('/content/drive/MyDrive/Analiza danych python/Ćwiczenia/AirBnB/Barcelona/listings.csv.gz')
print(f'Wymiary: {df.shape}\n')
df.head(1)
Wymiary: (18086, 75)

Out[ ]:
id listing_url scrape_id last_scraped source name description neighborhood_overview picture_url host_id ... review_scores_communication review_scores_location review_scores_value license instant_bookable calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
0 18674 https://www.airbnb.com/rooms/18674 20230906022853 2023-09-06 city scrape Rental unit in Barcelona · ★4.30 · 3 bedrooms ... 110m2 apartment to rent in Barcelona. Located ... Apartment in Barcelona located in the heart of... https://a0.muscache.com/pictures/13031453/413c... 71615 ... 4.65 4.78 4.27 HUTB-002062 t 30 30 0 0 0.3

1 rows × 75 columns

Widok na wszystkie kolumny oraz widok na pierwsze wiersze, ilość NaN i typy danych¶

Dokumentacja

Wnioski z tego kroku:

  • 75 kolumn, 18 tysięcy rekordów
  • [x] 'price' z object na float
  • [x] booleany wymagają poprawy typu z object na bool
  • [x] daty wymagają poprawy typu z object na datetime
  • [x] 2 kolumny z wartościami % (host_response_rate i host_acceptance_rate) wymagają poprawy z object na float
  • [x] kolumna bathrooms jest zbędna, informacje zawarte w formie tekstu w bathroom_text
  • [ ] kolumna calendar_updated jest zupełnie pusta, można usunąć
  • [ ] kolumna neighbourhood można usunąć, neighbourhood_cleansed ma lepsze (na podstawie współrzędnych)
  • [x] kolumna amenities mogłaby być przekształcona na kilka kolumn one-hot
  • [x] kolumna host_response_time może być zakodowana w formie ordinal encoding
id listing_url scrape_id last_scraped source
name description neighborhood_overview picture_url host_id
host_url host_name host_since host_location host_about
host_response_time host_response_rate host_acceptance_rate host_is_superhost host_thumbnail_url
host_picture_url host_neighbourhood host_listings_count host_total_listings_count host_verifications
host_has_profile_pic host_identity_verified neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed
latitude longitude property_type room_type accommodates
bathrooms bathrooms_text bedrooms beds amenities
price minimum_nights maximum_nights minimum_minimum_nights maximum_minimum_nights
minimum_maximum_nights maximum_maximum_nights minimum_nights_avg_ntm maximum_nights_avg_ntm calendar_updated
has_availability availability_30 availability_60 availability_90 availability_365
calendar_last_scraped number_of_reviews number_of_reviews_ltm number_of_reviews_l30d first_review
last_review review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin
review_scores_communication review_scores_location review_scores_value license instant_bookable
calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
print(pd.DataFrame(np.array(df.columns.tolist()).reshape(15,5), columns=['', '', '', '', '']).to_markdown(index=False))

Weryfikacja danych¶

  1. Cena
  2. Bools
  3. Łazienki
  4. DateTime
  5. Udogodnienia
  6. Ordinal response time
In [ ]:
length = df.shape[0]
print(f'Total records: {length}')

for i in np.array(df.columns.tolist()).reshape(15,5):
  hd = df[i].head(3)

  nans =  [df[j].isna().sum() for j in hd.columns]
  # nans_perc = [(str(int(round(df[j].isna().sum() / length * 100, 4))) + '%') for j in hd.columns]

  nans_perc = [ round(df[j].isna().sum() / length, 3) * 100 for j in hd.columns ]

  # nans_perc = [round(df[j].isna().sum() / length, 3) * 100 for j in hd.columns]
  uniques = [df[j].nunique() for j in hd.columns]
  typ = [ df[df[j].notnull()][j].dtype for j in hd.columns]

  # print(typ)

  hd.loc[len(hd)] = typ
  hd.loc[len(hd)] = nans
  hd.loc[len(hd)] = nans_perc
  hd.loc[len(hd)] = uniques

  hd.loc[len(hd)-3] = hd.loc[len(hd)-3].astype('int')
  hd.loc[len(hd)-1] = hd.loc[len(hd)-1].astype('int')

  hd = hd.rename(index={int(len(hd)-4): 'Data type',
                        int(len(hd)-3): 'Total NaN',
                        int(len(hd)-2): '% NaN',
                        int(len(hd)-1): 'Unique values',
                        })

  display(hd)
  print(' ')
Total records: 18086
id listing_url scrape_id last_scraped source
0 18674 https://www.airbnb.com/rooms/18674 20230906022853 2023-09-06 city scrape
1 23197 https://www.airbnb.com/rooms/23197 20230906022853 2023-09-06 city scrape
2 117010 https://www.airbnb.com/rooms/117010 20230906022853 2023-09-06 previous scrape
Data type int64 object int64 object object
Total NaN 0 0 0 0 0
% NaN 0.0 0.0 0.0 0.0 0.0
Unique values 18086 18086 1 1 2
 
name description neighborhood_overview picture_url host_id
0 Rental unit in Barcelona · ★4.30 · 3 bedrooms ... 110m2 apartment to rent in Barcelona. Located ... Apartment in Barcelona located in the heart of... https://a0.muscache.com/pictures/13031453/413c... 71615
1 Rental unit in Sant Adria de Besos · ★4.77 · 3... Beautiful spacious apartment, large terrace, 5... Strategically located in the Parc del Fòrum, a... https://a0.muscache.com/pictures/miso/Hosting-... 90417
2 Rental unit in Barcelona · ★4.55 · 3 bedrooms ... Have an authentic Barcelona experience! Modern... Apartment is located just 240 meters from Sagr... https://a0.muscache.com/pictures/4052d8b5-7352... 567180
Data type object object object object int64
Total NaN 0 118 7978 0 0
% NaN 0.0 0.7 44.1 0.0 0.0
Unique values 6237 16445 6967 17751 6969
 
host_url host_name host_since host_location host_about
0 https://www.airbnb.com/users/show/71615 Mireia And Maria 2010-01-19 Barcelona, Spain We are Mireia (47) & Maria (49), two multiling...
1 https://www.airbnb.com/users/show/90417 Etain (Marnie) 2010-03-09 Catalonia, Spain Hi there, I'm marnie from Australia, though I ...
2 https://www.airbnb.com/users/show/567180 Sandra 2011-05-08 Barcelona, Spain Suite Home Sagrada Familia is a complex of ap...
Data type object object object object object
Total NaN 0 2 2 4058 6698
% NaN 0.0 0.0 0.0 22.4 37.0
Unique values 6969 3363 3284 403 3560
 
host_response_time host_response_rate host_acceptance_rate host_is_superhost host_thumbnail_url
0 within an hour 97% 90% f https://a0.muscache.com/im/users/71615/profile...
1 within an hour 100% 94% t https://a0.muscache.com/im/pictures/user/44b56...
2 within a few hours 88% 98% f https://a0.muscache.com/im/pictures/user/7aaeb...
Data type object object object object object
Total NaN 2963 2963 2685 419 2
% NaN 16.4 16.4 14.8 2.3 0.0
Unique values 4 69 97 2 6804
 
host_picture_url host_neighbourhood host_listings_count host_total_listings_count host_verifications
0 https://a0.muscache.com/im/users/71615/profile... la Sagrada Família 47.0 48.0 ['email', 'phone']
1 https://a0.muscache.com/im/pictures/user/44b56... El Besòs i el Maresme 6.0 9.0 ['email', 'phone']
2 https://a0.muscache.com/im/pictures/user/7aaeb... la Sagrada Familia 19.0 19.0 ['email', 'phone']
Data type object object float64 float64 object
Total NaN 2 8545 2 2 0
% NaN 0.0 47.2 0.0 0.0 0.0
Unique values 6804 152 94 128 8
 
host_has_profile_pic host_identity_verified neighbourhood neighbourhood_cleansed neighbourhood_group_cleansed
0 t t Barcelona, CT, Spain la Sagrada Família Eixample
1 t t Sant Adria de Besos, Barcelona, Spain el Besòs i el Maresme Sant Martí
2 t f Barcelona, Catalonia, Spain la Sagrada Família Eixample
Data type object object object object object
Total NaN 2 2 7978 0 0
% NaN 0.0 0.0 44.1 0.0 0.0
Unique values 2 2 63 70 10
 
latitude longitude property_type room_type accommodates
0 41.40556 2.17262 Entire rental unit Entire home/apt 8
1 41.412432 2.21975 Entire rental unit Entire home/apt 5
2 41.40647 2.17457 Entire rental unit Entire home/apt 8
Data type float64 float64 object object int64
Total NaN 0 0 0 0 0
% NaN 0.0 0.0 0.0 0.0 0.0
Unique values 8878 9741 61 4 16
 
bathrooms bathrooms_text bedrooms beds amenities
0 NaN 2 baths 3.0 6.0 ["Refrigerator", "30\" TV", "Iron", "Hangers",...
1 NaN 2 baths 3.0 4.0 ["Refrigerator", "Toaster", "Cleaning products...
2 NaN 2 baths 3.0 6.0 ["Refrigerator", "Clothing storage: closet and...
Data type float64 object float64 float64 object
Total NaN 18086 9 6360 314 0
% NaN 100.0 0.0 35.2 1.7 0.0
Unique values 0 34 12 21 15556
 
price minimum_nights maximum_nights minimum_minimum_nights maximum_minimum_nights
0 $202.00 1 1125 1 4
1 $255.00 3 300 3 5
2 $331.00 2 30 2 3
Data type object int64 int64 int64 int64
Total NaN 0 0 0 0 0
% NaN 0.0 0.0 0.0 0.0 0.0
Unique values 698 86 204 88 98
 
minimum_maximum_nights maximum_maximum_nights minimum_nights_avg_ntm maximum_nights_avg_ntm calendar_updated
0 1125 1125 2.2 1125.0 NaN
1 1125 1125 3.2 1125.0 NaN
2 30 32 2.0 30.3 NaN
Data type int64 int64 float64 float64 float64
Total NaN 0 0 0 0 18086
% NaN 0.0 0.0 0.0 0.0 100.0
Unique values 192 189 412 817 0
 
has_availability availability_30 availability_60 availability_90 availability_365
0 t 4 11 21 34
1 t 16 31 61 150
2 f 0 0 0 0
Data type object int64 int64 int64 int64
Total NaN 0 0 0 0 0
% NaN 0.0 0.0 0.0 0.0 0.0
Unique values 2 31 61 91 366
 
calendar_last_scraped number_of_reviews number_of_reviews_ltm number_of_reviews_l30d first_review
0 2023-09-06 38 8 0 2013-05-27
1 2023-09-06 73 11 1 2011-03-15
2 2023-09-06 48 6 1 2011-08-09
Data type object int64 int64 int64 object
Total NaN 0 0 0 0 4466
% NaN 0.0 0.0 0.0 0.0 24.7
Unique values 1 532 179 32 3355
 
last_review review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin
0 2023-06-26 4.3 4.41 4.62 4.76
1 2023-08-15 4.77 4.93 4.89 4.94
2 2023-08-13 4.55 4.59 4.57 4.82
Data type object float64 float64 float64 float64
Total NaN 4466 4466 4549 4548 4553
% NaN 24.7 24.7 25.2 25.1 25.2
Unique values 1658 180 174 186 172
 
review_scores_communication review_scores_location review_scores_value license instant_bookable
0 4.65 4.78 4.27 HUTB-002062 t
1 4.99 4.6 4.64 HUTB005057 f
2 4.91 4.86 4.59 HUTB000331 f
Data type float64 float64 float64 object object
Total NaN 4547 4552 4553 5764 0
% NaN 25.1 25.2 25.2 31.9 0.0
Unique values 158 154 192 5635 2
 
calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms calculated_host_listings_count_shared_rooms reviews_per_month
0 30 30 0 0 0.3
1 2 2 0 0 0.48
2 19 19 0 0 0.33
Data type int64 int64 int64 int64 float64
Total NaN 0 0 0 0 4466
% NaN 0.0 0.0 0.0 0.0 24.7
Unique values 66 62 30 10 782
 
Poprawa price¶
  • brak NaN
  • wszystkie są w $
In [ ]:
def fix_price(price):
  price = price.split('$')[1]
  if ',' in price:
    price = price.replace(',', '')
  return float(price)

fix_price('$1,000')
Out[ ]:
1000.0
In [ ]:
df['price'] = df['price'].apply(lambda x: fix_price(x))
Poprawa booleanów i %¶
In [ ]:
# bool_fix = {
#     'f' : False,
#     't': True
# }
# df = df.applymap(lambda x: bool_fix.get(x, x))

bool_fix = {
    'f' : 0,
    't': 1
}
df = df.applymap(lambda x: bool_fix.get(x, x))
In [ ]:
def fix_rates(rate):
  if isinstance(rate, str):
    rate = rate.split('%')[0]
    return int(rate) / 100
  else:
    return rate
In [ ]:
df['host_response_rate'] = df['host_response_rate'].apply(lambda x: fix_rates(x))
df['host_acceptance_rate'] = df['host_acceptance_rate'].apply(lambda x: fix_rates(x))
Poprawa łazienek¶
In [ ]:
df['bathrooms_text'].unique()
Out[ ]:
array(['2 baths', '1.5 baths', '2.5 baths', '3 baths', '1 shared bath',
       '1 bath', '1 private bath', '3.5 baths', '4 baths',
       '1.5 shared baths', nan, '2 shared baths', '2.5 shared baths',
       '5.5 baths', '7.5 baths', '4.5 baths', '6 baths', '0 shared baths',
       'Half-bath', 'Private half-bath', '0 baths', '5 baths', '8 baths',
       '3 shared baths', '8 shared baths', '4 shared baths',
       'Shared half-bath', '5 shared baths', '3.5 shared baths',
       '6 shared baths', '5.5 shared baths', '10 baths',
       '10 shared baths', '6.5 baths', '4.5 shared baths'], dtype=object)

Pomysł na rozwiązanie:

  • 3 kolumny, jedna shared_bathrooms druga private_bathrooms i zostawić ogólną kolumnę nie rozróżniającą między prywatną i dzieloną łazienką
  • najpierw sprawdzić shared/private
  • sprawdzić cyfry w tekście, jeśli nie ma, będzie 1 łazienka
  • jeden wyjątek - nan, który jest typu float - będzie 0 łazienek
  • powinno być dosyć proste: liczba łazienek jest zawsze podana na początku tekstu
In [ ]:
def fix_private_bathrooms(bathtext):
  if isinstance(bathtext, float): # wyjątek dla `nan`
      return 0
  if isinstance(bathtext, str):
    if 'shared' in bathtext.lower():
      return 0
    else:
      n = bathtext.split(' ')[0]
      try: # czy jest w ogóle cyfra czy np. tylko 'Half-bath'
        return float(n)
      except ValueError:
        return 1
  else:
    return bathtext # zwróci None do uzupełnienia później


def fix_shared_bathrooms(bathtext):
  if isinstance(bathtext, float):
      return 0
  if isinstance(bathtext, str):
    if 'shared' not in bathtext.lower():
      return 0
    else:
      n = bathtext.split(' ')[0]
      try:
        return float(n)
      except ValueError:
        return 1
  else:
    return bathtext

def total_baths(bathtext): # nie ma przypadków, żeby były wspólne i prywatne jednocześnie, po prostu nie rozróżnia ilości
  if isinstance(bathtext, float):
      return 0
  if isinstance(bathtext, str):
      n = bathtext.split(' ')[0]
      try:
        return float(n)
      except ValueError:
        return 1
  else:
    return bathtext
In [ ]:
# TEST fix_private_bathrooms, fix_shared_bathrooms, total_baths
for i in df['bathrooms_text'].unique():
  print(i, type(i), fix_shared_bathrooms(i))
2 baths <class 'str'> 0
1.5 baths <class 'str'> 0
2.5 baths <class 'str'> 0
3 baths <class 'str'> 0
1 shared bath <class 'str'> 1.0
1 bath <class 'str'> 0
1 private bath <class 'str'> 0
3.5 baths <class 'str'> 0
4 baths <class 'str'> 0
1.5 shared baths <class 'str'> 1.5
nan <class 'float'> 0
2 shared baths <class 'str'> 2.0
2.5 shared baths <class 'str'> 2.5
5.5 baths <class 'str'> 0
7.5 baths <class 'str'> 0
4.5 baths <class 'str'> 0
6 baths <class 'str'> 0
0 shared baths <class 'str'> 0.0
Half-bath <class 'str'> 0
Private half-bath <class 'str'> 0
0 baths <class 'str'> 0
5 baths <class 'str'> 0
8 baths <class 'str'> 0
3 shared baths <class 'str'> 3.0
8 shared baths <class 'str'> 8.0
4 shared baths <class 'str'> 4.0
Shared half-bath <class 'str'> 1
5 shared baths <class 'str'> 5.0
3.5 shared baths <class 'str'> 3.5
6 shared baths <class 'str'> 6.0
5.5 shared baths <class 'str'> 5.5
10 baths <class 'str'> 0
10 shared baths <class 'str'> 10.0
6.5 baths <class 'str'> 0
4.5 shared baths <class 'str'> 4.5
In [ ]:
df['private_bathrooms'] = df['bathrooms_text'].apply(lambda x: fix_private_bathrooms(x))
df['shared_bathrooms'] = df['bathrooms_text'].apply(lambda x: fix_shared_bathrooms(x))
df['total_bathrooms'] = df['bathrooms_text'].apply(lambda x: total_baths(x))
In [ ]:
df[['private_bathrooms', 'shared_bathrooms', 'total_bathrooms']].head()
Out[ ]:
private_bathrooms shared_bathrooms total_bathrooms
0 2.0 0.0 2.0
1 2.0 0.0 2.0
2 2.0 0.0 2.0
3 1.5 0.0 1.5
4 2.5 0.0 2.5
Poprawa DateTime¶
In [ ]:
from datetime import datetime
In [ ]:
datetime_colnames = ['last_scraped', 'last_scraped', 'host_since', 'calendar_last_scraped', 'first_review', 'last_review']
In [ ]:
df[datetime_colnames].head(3)
Out[ ]:
last_scraped last_scraped host_since calendar_last_scraped first_review last_review
0 2023-09-06 2023-09-06 2010-01-19 2023-09-06 2013-05-27 2023-06-26
1 2023-09-06 2023-09-06 2010-03-09 2023-09-06 2011-03-15 2023-08-15
2 2023-09-06 2023-09-06 2011-05-08 2023-09-06 2011-08-09 2023-08-13
In [ ]:
test = df[datetime_colnames].head(3).loc[0][0]
print(test, type(test))
fixed = datetime.strptime(test, '%Y-%m-%d').date()
print(fixed, type(fixed))
2023-09-06 <class 'str'>
2023-09-06 <class 'datetime.date'>
In [ ]:
# df[datetime_colnames].applymap(lambda x: datetime.strptime(x, '%Y-%m-%d').date() if isinstance(x, str) else x)
In [ ]:
for i in datetime_colnames:
  df[i] = df[i].apply(lambda x: datetime.strptime(x, '%Y-%m-%d').date() if isinstance(x, str) else x)
In [ ]:
print(df['last_scraped'][0])
type(df['last_scraped'][0])
2023-09-06
Out[ ]:
datetime.date
Amenities¶

Udogodnienia warte uwzględnienia:

  • refrigerator
  • tv
  • iron?
  • crib?
  • parking
  • elevator
  • ac / air conditioning / heating
  • wifi
  • kitchen / microwave / stove / oven / dishwasher / toaster
  • washer / washing machine
In [ ]:
df['amenities'].notnull().sum()
Out[ ]:
18086
In [ ]:
# # widok na kilka przykładów
# for i in range(10):
#  display(df['amenities'][i].lower())
In [ ]:
amenities = {'refrigerator': ['refrigerator'],
             'tv' : ['tv', 'television'],
             'iron': ['iron'],
             'crib': ['crib'],
             'parking': ['parking'],
             'elevator': ['elevator'],
             'AC/heating' : ['ac', 'air', 'conditioning', 'heat',],
             'wifi': ['wifi', 'internet'],
             'kitchen' : ['kitchen', 'microwave', 'stove', 'oven', 'dishwasher', 'toaster',],
             'washer' : ['wash']}
In [ ]:
for key in amenities:
  print(f'Key: {key}, Value: {amenities[key]}')
Key: refrigerator, Value: ['refrigerator']
Key: tv, Value: ['tv', 'television']
Key: iron, Value: ['iron']
Key: crib, Value: ['crib']
Key: parking, Value: ['parking']
Key: elevator, Value: ['elevator']
Key: AC/heating, Value: ['ac', 'air', 'conditioning', 'heat']
Key: wifi, Value: ['wifi', 'internet']
Key: kitchen, Value: ['kitchen', 'microwave', 'stove', 'oven', 'dishwasher', 'toaster']
Key: washer, Value: ['wash']
In [ ]:
def check_amenities(text, key):
  text = text.lower()

  return any(i in text for i in amenities[key])
In [ ]:
for key in amenities:
  df[key] = df['amenities'].apply(lambda x: check_amenities(x, key))
In [ ]:
df['washer'].value_counts()
Out[ ]:
True     15077
False     3009
Name: washer, dtype: int64
In [ ]:
df[[i for i in amenities]].tail(5)
Out[ ]:
refrigerator tv iron crib parking elevator AC/heating wifi kitchen washer
18081 True True False False False True True True True True
18082 True True True True True False True True True True
18083 True True True True False False True True True True
18084 False True False False True False True True True True
18085 False True False False True False True True True True
Ordinal encoding host_response_time¶
In [ ]:
df['host_response_time'].value_counts()
Out[ ]:
within an hour        10879
within a few hours     2393
within a day           1558
a few days or more      293
Name: host_response_time, dtype: int64
In [ ]:
response_ordinal = {
    'within an hour' : 1,
    'within a few hours' : 2,
    'within a day' : 3,
    'a few days or more' : 4
}

df['host_response_time'] = df['host_response_time'].map(response_ordinal)
In [ ]:
df['host_response_time'].unique()
Out[ ]:
array([ 1.,  2.,  3., nan,  4.])
Usunięcie zbędnych kolum¶
In [ ]:
df.drop(columns=['bathrooms','bathrooms_text', 'calendar_updated'], inplace=True)
df.shape
Out[ ]:
(18086, 85)

Posumowanie weryfikacji:¶

  1. Poprawiono typy danych z object na float, DateTime, bool
  2. Utworzono trzy nowe kolumny dotyczące łazienek - prywatne, dzielone i ogólnie obie
  3. Dodano następujące kolumny: refrigerator tv iron crib parking elevator AC/heating wifi kitchen washer
  4. Usunięto zupełnie pustą kolumnę calendar_updated

Analiza danych¶

Widok na numeryczne kolumny¶

Uwagi:

  • minimum_maximum_nights, maximum_maximum_nights i maximum_nights_avg_ntm mają bardzo wysokie wartości
  • price również ma miejscami bardzo wysokie wartości
  • Barcelona ma szereg regulacji dot. krótkoterminowych wynajmów (short-term rentals), między innymi zakaz wynajmu na mniej niż miesiąc bez odpowiedniej licencji. Warta sprawdzenia będzie relacja między rodzajem licencji a możliwą ilością noclegów. Przykładowo exempt w kolumnie license oznacza, że gospodarz nie potrzebuje licencji i może wynająć lokal na 1-11 miesięcy.
In [ ]:
cols = '''host_response_rate host_acceptance_rate host_listings_count host_total_listings_count accommodates total_bathrooms private_bathrooms shared_bathrooms beds price minimum_nights 	maximum_nights 	minimum_minimum_nights 	maximum_minimum_nights minimum_maximum_nights 	maximum_maximum_nights 	minimum_nights_avg_ntm 	maximum_nights_avg_ntm availability_30 	availability_60 	availability_90 	availability_365 number_of_reviews 	number_of_reviews_ltm 	number_of_reviews_l30d review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication review_scores_location review_scores_value calculated_host_listings_count 	calculated_host_listings_count_entire_homes 	calculated_host_listings_count_private_rooms 	calculated_host_listings_count_shared_rooms reviews_per_month
'''
cols = cols.replace('\n', '')
cols = cols.replace(' 	', ' ')
cols = cols.split(' ')
for i in cols:
  if ' ' in i:
    cols.pop(i)
print(cols)
print(len(cols))
['host_response_rate', 'host_acceptance_rate', 'host_listings_count', 'host_total_listings_count', 'accommodates', 'total_bathrooms', 'private_bathrooms', 'shared_bathrooms', 'beds', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'calculated_host_listings_count', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms', 'reviews_per_month']
37
In [ ]:
step = 0
for i in range(7):
  display(df[cols[step: step+5]].describe().round(2)[1::]) # pomija `count`, już znane
  print('')
  step += 5
display(7, df[cols[35:]].describe().round(2)[1::])
host_response_rate host_acceptance_rate host_listings_count host_total_listings_count accommodates
mean 0.94 0.87 41.13 53.55 3.36
std 0.14 0.23 87.47 109.17 2.14
min 0.00 0.00 1.00 1.00 1.00
25% 0.96 0.86 2.00 2.00 2.00
50% 1.00 0.99 6.00 9.00 3.00
75% 1.00 1.00 34.00 48.00 4.00
max 1.00 1.00 786.00 1853.00 16.00

total_bathrooms private_bathrooms shared_bathrooms beds price
mean 1.36 1.05 0.31 2.36 162.80
std 0.69 0.83 0.67 1.83 1057.27
min 0.00 0.00 0.00 1.00 8.00
25% 1.00 1.00 0.00 1.00 52.00
50% 1.00 1.00 0.00 2.00 100.00
75% 2.00 1.50 0.00 3.00 185.00
max 10.00 10.00 10.00 30.00 90000.00

minimum_nights maximum_nights minimum_minimum_nights maximum_minimum_nights minimum_maximum_nights
mean 14.76 574.35 14.54 18.89 2.381127e+05
std 32.36 448.48 33.69 46.54 2.258195e+07
min 1.00 1.00 1.00 1.00 1.000000e+00
25% 1.00 180.25 1.00 3.00 3.000000e+02
50% 3.00 365.00 2.00 5.00 3.650000e+02
75% 31.00 1125.00 31.00 31.00 1.125000e+03
max 1125.00 3000.00 1125.00 2705.00 2.147484e+09

maximum_maximum_nights minimum_nights_avg_ntm maximum_nights_avg_ntm availability_30 availability_60
mean 2.381551e+05 17.34 2.381370e+05 7.72 19.37
std 2.258195e+07 40.63 2.258195e+07 9.48 19.80
min 1.000000e+00 1.00 1.000000e+00 0.00 0.00
25% 3.330000e+02 2.00 3.300000e+02 0.00 0.00
50% 9.990000e+02 3.70 4.071000e+02 4.00 13.00
75% 1.125000e+03 31.00 1.125000e+03 12.00 34.00
max 2.147484e+09 1125.00 2.147484e+09 30.00 60.00

availability_90 availability_365 number_of_reviews number_of_reviews_ltm number_of_reviews_l30d
mean 36.43 171.82 42.22 11.44 0.92
std 30.56 129.47 85.73 23.27 1.99
min 0.00 0.00 0.00 0.00 0.00
25% 1.00 47.00 1.00 0.00 0.00
50% 35.00 175.00 6.00 2.00 0.00
75% 62.00 302.00 42.00 15.00 1.00
max 90.00 365.00 1817.00 836.00 91.00

review_scores_rating review_scores_accuracy review_scores_cleanliness review_scores_checkin review_scores_communication
mean 4.54 4.62 4.59 4.71 4.71
std 0.62 0.51 0.52 0.46 0.47
min 0.00 0.00 0.00 0.00 0.00
25% 4.41 4.50 4.45 4.64 4.64
50% 4.68 4.75 4.72 4.85 4.85
75% 4.90 4.93 4.92 5.00 5.00
max 5.00 5.00 5.00 5.00 5.00

review_scores_location review_scores_value calculated_host_listings_count calculated_host_listings_count_entire_homes calculated_host_listings_count_private_rooms
mean 4.74 4.44 31.63 23.87 7.60
std 0.37 0.56 60.75 54.20 32.38
min 0.00 0.00 1.00 0.00 0.00
25% 4.67 4.28 1.00 0.00 0.00
50% 4.83 4.56 5.00 2.00 0.00
75% 5.00 4.78 28.00 19.00 2.00
max 5.00 5.00 294.00 294.00 233.00

7
calculated_host_listings_count_shared_rooms reviews_per_month
mean 0.08 1.44
std 0.74 1.85
min 0.00 0.01
25% 0.00 0.24
50% 0.00 0.85
75% 0.00 2.10
max 12.00 55.02
Kolumny minimum_maximum_nights, maximum_maximum_nights¶
  • ta kolumna ma nadzwyczaj wysokie wartości, a konkretnie max. możliwą wartość dla int-32
  • jest to jedna z 6 kolumn, które zawierają wartości obliczone, nie zescrapowane
  • kolumny te są liczone na podstawie kalendarza na przyszły rok
  • większość rekordów ma 'przewidziane' ilości maksymalnej ilości noclegów na rok lub 3 lata, więc osobiście traktowałbym je jako bardzo nierzetelne wartości i usunąłbym je
In [ ]:

Kolumny dot. liczby noclegów przedstawiają się następująco:

Field Type Calculated Description
minimum_nights integer minimum number of night stay for the listing (calendar rules may be different)
maximum_nights integer maximum number of night stay for the listing (calendar rules may be different)
minimum_minimum_nights integer y the smallest minimum_night value from the calender (looking 365 nights in the future)
maximum_minimum_nights integer y the largest minimum_night value from the calender (looking 365 nights in the future)
minimum_maximum_nights integer y the smallest maximum_night value from the calender (looking 365 nights in the future)
maximum_maximum_nights integer y the largest maximum_night value from the calender (looking 365 nights in the future)
minimum_nights_avg_ntm numeric y the average minimum_night value from the calender (looking 365 nights in the future)
maximum_nights_avg_ntm numeric y the average maximum_night value from the calender (looking 365 nights in the future)
In [ ]:
df[df['minimum_maximum_nights'] >= 3000]['minimum_maximum_nights']
Out[ ]:
48            3000
744           3000
2463    2147483647
2540    2147483647
9162          9999
Name: minimum_maximum_nights, dtype: int64
In [ ]:
nights = '''minimum_nights
maximum_nights
minimum_minimum_nights
maximum_minimum_nights
minimum_maximum_nights
maximum_maximum_nights
minimum_nights_avg_ntm
maximum_nights_avg_ntm'''

df.iloc[48][nights.split('\n')]
Out[ ]:
minimum_nights                28
maximum_nights              3000
minimum_minimum_nights        28
maximum_minimum_nights        28
minimum_maximum_nights      3000
maximum_maximum_nights      3000
minimum_nights_avg_ntm      28.0
maximum_nights_avg_ntm    3000.0
Name: 48, dtype: object
In [ ]:
print(df.query('minimum_nights != minimum_minimum_nights').shape)
print(df.query('minimum_nights != maximum_minimum_nights').shape)
print(df.query('maximum_nights != minimum_maximum_nights').shape)
print(df.query('minimum_nights != maximum_maximum_nights').shape)
(3545, 85)
(5126, 85)
(4153, 85)
(17991, 85)
In [ ]:
df.query('minimum_nights != minimum_minimum_nights')[nights.split('\n')].head()
Out[ ]:
minimum_nights maximum_nights minimum_minimum_nights maximum_minimum_nights minimum_maximum_nights maximum_maximum_nights minimum_nights_avg_ntm maximum_nights_avg_ntm
3 21 31 1 1 31 31 1.0 31.0
6 3 1125 2 6 1125 1125 3.3 1125.0
7 5 365 2 5 1125 1125 4.9 1125.0
14 2 364 1 4 364 364 2.0 364.0
20 1 1125 3 3 1125 1125 3.0 1125.0
In [ ]:
display(df[df['minimum_maximum_nights'] < 3000]['minimum_maximum_nights'].describe().round(2))
print(f"Moda: {df[df['minimum_maximum_nights'] < 3000]['minimum_maximum_nights'].mode()}")
count    18081.00
mean       637.31
std        464.52
min          1.00
25%        300.00
50%        365.00
75%       1125.00
max       1125.00
Name: minimum_maximum_nights, dtype: float64
Moda: 0    1125
Name: minimum_maximum_nights, dtype: int64
In [ ]:
df.loc[2463, 'minimum_maximum_nights'] = 365
df.loc[2540, 'minimum_maximum_nights'] = 365
In [ ]:
fig = px.histogram(df[df['minimum_maximum_nights'] < 3000], x='minimum_maximum_nights', barmode='group')

fig.update_layout(
    width=500, height=500, template='plotly_dark', bargap=0.05,
    title='Minimum maximum nights poniżej 3000',
    xaxis_title="Minimum maximum nights",
    yaxis_title="Ilość")
In [ ]:
fig = px.histogram(df, x='maximum_nights', barmode='group')
fig.update_layout(width=500, height=500, template='plotly_dark', bargap=0.05,)
In [ ]:
fig = px.histogram(df, x='minimum_maximum_nights', barmode='group')
fig.update_layout(width=500, height=500, template='plotly_dark', bargap=0.05,)
Cena¶

i znacznie odstające wartości

In [ ]:
df['price'].describe()[1:].round(2)
Out[ ]:
mean      162.80
std      1057.27
min         8.00
25%        52.00
50%       100.00
75%       185.00
max     90000.00
Name: price, dtype: float64
In [ ]:
fig = px.histogram(df, x='price', barmode='group')
fig.update_layout(width=500, height=500, template='plotly_dark', bargap=0.05,)
In [ ]:
prices = df[['listing_url','price']].sort_values(by='price')
prices
Out[ ]:
listing_url price
15499 https://www.airbnb.com/rooms/871607319497888166 8.0
2940 https://www.airbnb.com/rooms/13873341 9.0
3440 https://www.airbnb.com/rooms/17100681 9.0
212 https://www.airbnb.com/rooms/587203 9.0
4657 https://www.airbnb.com/rooms/22218299 9.0
... ... ...
8770 https://www.airbnb.com/rooms/43479422 9200.0
7439 https://www.airbnb.com/rooms/37486088 11000.0
5374 https://www.airbnb.com/rooms/25927238 57731.0
9598 https://www.airbnb.com/rooms/47725335 84999.0
3092 https://www.airbnb.com/rooms/14484042 90000.0

18086 rows × 2 columns

In [ ]:
prices.sample()
Out[ ]:
listing_url price
12133 https://www.airbnb.com/rooms/639224424789847854 174.0
In [ ]:
prices.query('price < 50').shape
Out[ ]:
(3984, 2)
In [ ]:
prices.tail(30)
Out[ ]:
listing_url price
6220 https://www.airbnb.com/rooms/31032688 2000.0
6051 https://www.airbnb.com/rooms/29808581 2100.0
7389 https://www.airbnb.com/rooms/36820054 2230.0
5984 https://www.airbnb.com/rooms/29251355 2500.0
11704 https://www.airbnb.com/rooms/588106770878222950 2500.0
5944 https://www.airbnb.com/rooms/28874308 3000.0
5353 https://www.airbnb.com/rooms/25758097 3000.0
9563 https://www.airbnb.com/rooms/46731061 3528.0
8897 https://www.airbnb.com/rooms/44000385 4000.0
8643 https://www.airbnb.com/rooms/42893238 4124.0
16651 https://www.airbnb.com/rooms/917928804879169822 4500.0
6366 https://www.airbnb.com/rooms/32088715 5214.0
9102 https://www.airbnb.com/rooms/45100047 5722.0
2952 https://www.airbnb.com/rooms/14089552 6979.0
8639 https://www.airbnb.com/rooms/42890622 7000.0
8772 https://www.airbnb.com/rooms/43483528 8478.0
1878 https://www.airbnb.com/rooms/7105437 8999.0
8782 https://www.airbnb.com/rooms/43483589 9100.0
8779 https://www.airbnb.com/rooms/43483560 9100.0
8778 https://www.airbnb.com/rooms/43483536 9100.0
6775 https://www.airbnb.com/rooms/33699122 9199.0
8764 https://www.airbnb.com/rooms/43479394 9200.0
8766 https://www.airbnb.com/rooms/43479417 9200.0
8768 https://www.airbnb.com/rooms/43479419 9200.0
8771 https://www.airbnb.com/rooms/43479425 9200.0
8770 https://www.airbnb.com/rooms/43479422 9200.0
7439 https://www.airbnb.com/rooms/37486088 11000.0
5374 https://www.airbnb.com/rooms/25927238 57731.0
9598 https://www.airbnb.com/rooms/47725335 84999.0
3092 https://www.airbnb.com/rooms/14484042 90000.0
In [ ]:
prices.query('price == 22').head(20)
Out[ ]:
listing_url price
11737 https://www.airbnb.com/rooms/589115666385828682 22.0
3746 https://www.airbnb.com/rooms/18420309 22.0
3557 https://www.airbnb.com/rooms/17468995 22.0
6554 https://www.airbnb.com/rooms/32736627 22.0
11780 https://www.airbnb.com/rooms/607907530110718284 22.0
6221 https://www.airbnb.com/rooms/31203362 22.0
6583 https://www.airbnb.com/rooms/33211494 22.0
16787 https://www.airbnb.com/rooms/922364898996026177 22.0
7164 https://www.airbnb.com/rooms/36374867 22.0
16093 https://www.airbnb.com/rooms/895158128888394391 22.0
3863 https://www.airbnb.com/rooms/18697011 22.0
17866 https://www.airbnb.com/rooms/964279024581469247 22.0
3082 https://www.airbnb.com/rooms/15013369 22.0
3084 https://www.airbnb.com/rooms/14299410 22.0
6586 https://www.airbnb.com/rooms/32832065 22.0
4408 https://www.airbnb.com/rooms/21366324 22.0
9897 https://www.airbnb.com/rooms/49056712 22.0
9863 https://www.airbnb.com/rooms/48866420 22.0
16221 https://www.airbnb.com/rooms/902291658003414702 22.0
7213 https://www.airbnb.com/rooms/36128368 22.0

Sprawdziłem niektóre apartamenty i ceny są absurdalne (również biorąc pod uwagę co jest oferowane). Niektóre oferty o bardzo wysokiej cenie, np.:

listing_url price
4159 https://www.airbnb.com/rooms/20472746 1533.0

zdają się (relatywnie 😅) uzasadnione. Górną granicę ucięcia cen ustaliłbym na $2000. Jest to ostatnia sensownie wyglądająca oferta, po niej znajdują się już niemal tylko oferty błędne i te co już wygasły i nie da się ich sprawdzić.


Z kolei oferty o bardzo niskiej cenie są najczęściej niedostępne (cena nie jest możliwa do sprawdzenia) lub są źle podane / przestarzałe. Np.:

listing_url price
212 https://www.airbnb.com/rooms/587203 9.0

w rzeczywistości jest podane jako 2778 zł za miesiąc / 30 / 5 = ok. $19 Sprawdzając oferty niższego przedziału, muszę stwierdzić że bezpieczniejszym progiem byłoby $20. Tak więc po ucięciu tabeli do ofert od 20 do 2000, mamy 17783 rekordów, czyli usunięto niecałe 2% rekordów.

Absurd apartament.png

In [ ]:
print(f"Total : {prices.query('price >= 20 & price <= 2000').shape[0]}")
print(f"Percentage left: {round(prices.query('price >= 20 & price <= 2000').shape[0] / df.shape[0], 3)}")
Total : 17783
Percentage left: 0.983
In [ ]:
prices.query('price >= 20 & price <= 2000').describe()[1:].round(2)
Out[ ]:
price
mean 142.83
std 146.41
min 20.00
25% 55.00
50% 100.00
75% 185.00
max 2000.00
In [ ]:
fig = px.histogram(prices.query('price >= 20 & price <= 2000'), x='price',
                   color_discrete_sequence=[px.colors.qualitative.Plotly[2]])
fig.update_layout(width=500, height=500, template='plotly_dark', bargap=0.05,)

fig.update_layout(
    width=500, height=500, template='plotly_dark', bargap=0.05,
    title='Histogram cen noclegów po filtrowaniu',
    xaxis_title="Cena za nocleg w $",
    yaxis_title="Ilość")
Ograniczenie ofert do cen od \$20 do \$2000¶
In [ ]:
print(df.shape)
df = df.query('price >= 20 & price <= 2000')
df.shape
(18086, 85)
Out[ ]:
(17783, 85)

Widok na kategoryczne kolumny¶

In [ ]:
# df.drop(cols, axis=1).columns
In [ ]:
categ = ['source', 'host_name', 'host_location',
'host_is_superhost', 'host_neighbourhood', 'host_verifications',
'host_has_profile_pic', 'host_identity_verified', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed',  'property_type', 'room_type',
'has_availability', 'license', 'instant_bookable',
'refrigerator', 'tv', 'iron', 'crib', 'parking', 'elevator', 'AC/heating', 'wifi', 'kitchen', 'washer']
len(categ)
Out[ ]:
25
In [ ]:
step = 0
for i in categ:
  d = df[i].describe()[:2]
  d['% NaN'] = round(df[i].isna().sum() / length * 100, 2)

  u = df[i].value_counts()[:7]

  x = pd.concat([d, u]).reset_index().rename(columns={'index':'values'})
  # x.at[0, i] = x[0] / length
  display(x)
  print('')
values source
0 count 17783
1 unique 2
2 % NaN 0.0
3 city scrape 14622
4 previous scrape 3161

values host_name
0 count 17781
1 unique 3308
2 % NaN 0.01
3 Ukio 294
4 Enter 234
5 SweetInn 221
6 AB Apartment 184
7 Acomodis Apartments 158
8 Selling 149
9 Stay U-Nique 149

values host_location
0 count 13801
1 unique 389
2 % NaN 22.02
3 Barcelona, Spain 12403
4 Spain 168
5 Madrid, Spain 91
6 Catalonia, Spain 77
7 London, United Kingdom 53
8 Paris, France 39
9 Sant Just Desvern, Spain 30

values host_is_superhost
0 count 17365.000000
1 mean 0.176044
2 % NaN 2.310000
3 0.0 14308.000000
4 1.0 3057.000000

values host_neighbourhood
0 count 9381
1 unique 151
2 % NaN 46.46
3 Dreta de l'Eixample 1754
4 Vila de Gràcia 700
5 El Raval 639
6 La Sagrada Família 598
7 Sant Antoni 536
8 El Poble-sec 533
9 el Fort Pienc 414

values host_verifications
0 count 17783
1 unique 8
2 % NaN 0.0
3 ['email', 'phone'] 13598
4 ['email', 'phone', 'work_email'] 2352
5 ['phone'] 1227
6 ['phone', 'work_email'] 564
7 ['email'] 33
8 [] 6
9 None 2

values host_has_profile_pic
0 count 17781.000000
1 mean 0.981778
2 % NaN 0.010000

values host_identity_verified
0 count 17781.000000
1 mean 0.921827
2 % NaN 0.010000

values neighbourhood_cleansed
0 count 17783
1 unique 70
2 % NaN 0.0
3 la Dreta de l'Eixample 2206
4 el Raval 1443
5 el Barri Gòtic 1221
6 Sant Pere, Santa Caterina i la Ribera 1120
7 la Vila de Gràcia 1043
8 l'Antiga Esquerra de l'Eixample 1019
9 la Sagrada Família 1004

values neighbourhood_group_cleansed
0 count 17783
1 unique 10
2 % NaN 0.0
3 Eixample 6393
4 Ciutat Vella 4188
5 Sants-Montjuïc 1840
6 Sant Martí 1619
7 Gràcia 1556
8 Sarrià-Sant Gervasi 880
9 Horta-Guinardó 519

values property_type
0 count 17783
1 unique 61
2 % NaN 0.0
3 Entire rental unit 9180
4 Private room in rental unit 5361
5 Entire serviced apartment 495
6 Room in hotel 395
7 Entire condo 372
8 Entire loft 293
9 Private room in condo 219

values room_type
0 count 17783
1 unique 4
2 % NaN 0.0
3 Entire home/apt 10608
4 Private room 6904
5 Shared room 142
6 Hotel room 129

values has_availability
0 count 17783.000000
1 mean 0.937581
2 % NaN 0.000000
3 1 16673.000000
4 0 1110.000000

values license
0 count 12186
1 unique 5624
2 % NaN 30.95
3 Exempt 4682
4 HUTB-000000 112
5 HUTB-123456 25
6 AJ000593 22
7 HB004232 19
8 HUTB-022013 18
9 AJ000517 15

values instant_bookable
0 count 17783.000000
1 mean 0.381882
2 % NaN 0.000000
3 0 10992.000000
4 1 6791.000000

values refrigerator
0 count 17783
1 unique 2
2 % NaN 0.0
3 True 11851
4 False 5932

values tv
0 count 17783
1 unique 2
2 % NaN 0.0
3 True 14062
4 False 3721

values iron
0 count 17783
1 unique 2
2 % NaN 0.0
3 True 12588
4 False 5195

values crib
0 count 17783
1 unique 2
2 % NaN 0.0
3 False 12553
4 True 5230

values parking
0 count 17783
1 unique 2
2 % NaN 0.0
3 False 11053
4 True 6730

values elevator
0 count 17783
1 unique 2
2 % NaN 0.0
3 True 9631
4 False 8152

values AC/heating
0 count 17783
1 unique 2
2 % NaN 0.0
3 True 16867
4 False 916

values wifi
0 count 17783
1 unique 2
2 % NaN 0.0
3 True 17400
4 False 383

values kitchen
0 count 17783
1 unique 2
2 % NaN 0.0
3 True 16384
4 False 1399

values washer
0 count 17783
1 unique 2
2 % NaN 0.0
3 True 14839
4 False 2944

Podsumowanie danych kategorycznych¶
  • host_name wskazuje, że dość wiele ofert jest złożonych przez instytucje, nie indywidualne osoby
  • jak wcześniej wspomniano, licencja będzie dość istotną kategorią,
    • człon HUTB lub HUT, Habitatge d’Ús Turístic (B dotyczy Barcelony) oznacza licencję, która umożliwia właśnie krótkoterminowy wynajem, obciążona specjalnym podatkiem. Pozostałe litery członu oznaczają id lokalu.
    • Exempt oznacza, że licencja HUT nie była wymagana (dla ułatwienia będę odnosił się do tego i tak jako licencji)
  • one-hot-encoding dla amenities okazało się sensownym zabiegiem, wiele z tych kategorii ma ciekawe proporcje, które mogą okazać się istotne w kontekście korelacji
Licencje¶
  • nie udało mi się zweryfikować czy istnieją inne licencje niż HUTB
  • po poprawieniu licencji rozkład licencji rozkłada się następująco:

    • HUTB %: 40.2
    • Exempt %: 26.45
    • NaN %: 31.87
    • Pozostałe to 1,5%

NaN nie powinno być trudne do uzupełnienia, o ile rzeczywiście licencje poprawnie odwzorowują ilość noclegów (HUTB zezwala na mniej niż 31 dni)

In [ ]:
def check_license(lcode):

    if isinstance(lcode, str):
      code = ""
      for c in lcode:
          if c.isalpha():
              code += c
          else:
              break
      return code
    else:
      return lcode

check_license('HUTB-123456')
Out[ ]:
'HUTB'
In [ ]:
for i in range(10):
  print(
      df.query('license.notnull() and license != "Exempt"')['license'].sample().apply(check_license)
      )
9138    HUTB
Name: license, dtype: object
517    HUTB
Name: license, dtype: object
14941    HUTB
Name: license, dtype: object
4479    HUTB
Name: license, dtype: object
11786    HB
Name: license, dtype: object
9616    HUTB
Name: license, dtype: object
3809    HUTB
Name: license, dtype: object
10299    HUTB
Name: license, dtype: object
1349    HUTB
Name: license, dtype: object
10735    HUTB
Name: license, dtype: object
In [ ]:
df['license_cleansed'] = df['license'].apply(check_license)
df['license_cleansed']
Out[ ]:
0        HUTB
1        HUTB
2        HUTB
3        HUTB
4        HUTB
         ... 
18081     NaN
18082    HUTB
18083     NaN
18084    HUTB
18085    HUTB
Name: license_cleansed, Length: 17783, dtype: object
In [ ]:
df['license_cleansed'].value_counts()
Out[ ]:
HUTB         6397
Exempt       4682
HB            725
AJ            167
hutb           37
ATB            31
Hutb           26
HUTS           22
               15
hb             14
B              13
No              8
HT              7
HUTTE           6
HUTG            4
HV              3
ASB             3
Pendiente       2
T               2
Media           1
HUTb            1
LLB             1
HUTL            1
Monthly         1
HTUB            1
HUBT            1
Numero          1
MINIMUM         1
EP              1
asd             1
El              1
Temporada       1
MID             1
Eri             1
Tourist         1
Periodo         1
Ref             1
Carrera         1
Período         1
LY              1
NIF             1
Name: license_cleansed, dtype: int64

Jak widać jest wiele różnych licencji, najlepiej je uśrednić

In [ ]:
def fix_license(lcode):
  if isinstance(lcode, str):
    if len(lcode) != 0 and lcode.upper()[0] == 'H' :
      return 'HUTB'
    else:
      return lcode
  else:
    return lcode

fix_license('hutb')
Out[ ]:
'HUTB'
In [ ]:
df['license_cleansed'] = df['license_cleansed'].apply(lambda x: fix_license(x))
df['license_cleansed'].value_counts()
Out[ ]:
HUTB         7245
Exempt       4682
AJ            167
ATB            31
               15
B              13
No              8
ASB             3
Pendiente       2
T               2
Numero          1
Eri             1
LLB             1
MINIMUM         1
EP              1
asd             1
Temporada       1
MID             1
Periodo         1
Tourist         1
Monthly         1
Carrera         1
NIF             1
El              1
Media           1
Período         1
Ref             1
LY              1
Name: license_cleansed, dtype: int64
In [ ]:
print(f"HUTB %: {round(df['license_cleansed'].value_counts()['HUTB'] / df.shape[0] * 100, 2)}")
print(f"Exempt %: {round(df['license_cleansed'].value_counts()['Exempt'] / df.shape[0]* 100, 2)}")
print(f"NaN %: {round(df['license_cleansed'].isna().sum()  / df.shape[0]* 100, 2)}")
print('Pozostałe to 1,5%')
print('NaN nie powinno być trudne do uzupełnienia, o ile rzeczywiście licencje poprawnie odwzorowują ilość noclegów (HUTB umożliwia wynajem poniżej 31 dni)')
HUTB %: 40.74
Exempt %: 26.33
NaN %: 31.47
Pozostałe to 1,5%
NaN nie powinno być trudne do uzupełnienia, o ile rzeczywiście licencje poprawnie odwzorowują ilość noclegów (HUTB umożliwia wynajem poniżej 31 dni)
In [ ]:
grouped_nights = df.groupby('license_cleansed').describe()[nights.split('\n')].sort_values(('minimum_nights','count'), ascending=False)
grouped_nights
Out[ ]:
minimum_nights maximum_nights ... minimum_nights_avg_ntm maximum_nights_avg_ntm
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
license_cleansed
HUTB 7245.0 3.190200 13.376993 1.0 1.0 2.0 3.0 865.0 7245.0 622.537336 ... 3.70 865.0 7245.0 6.972791e+02 4.658369e+02 2.0 355.60 999.0 1125.0 9.999000e+03
Exempt 4682.0 16.792610 34.970667 1.0 1.0 4.0 31.0 1000.0 4682.0 504.924178 ... 31.00 1000.0 4682.0 6.429399e+02 4.490364e+02 1.0 340.00 365.0 1125.0 1.125000e+03
AJ 167.0 1.059880 0.237978 1.0 1.0 1.0 1.0 2.0 167.0 767.077844 ... 1.20 4.9 167.0 6.441222e+02 4.328183e+02 6.5 215.80 955.3 999.0 1.125000e+03
ATB 31.0 1.032258 0.179605 1.0 1.0 1.0 1.0 2.0 31.0 642.258065 ... 1.50 2.7 31.0 1.385480e+08 5.362931e+08 12.0 365.00 999.0 1125.0 2.147484e+09
15.0 21.600000 23.900986 1.0 2.5 30.0 31.5 90.0 15.0 536.333333 ... 31.50 90.0 15.0 6.700400e+02 4.497104e+02 60.0 315.00 400.0 1125.0 1.125000e+03
B 13.0 3.307692 8.320503 1.0 1.0 1.0 1.0 31.0 13.0 1066.461538 ... 2.00 31.0 13.0 1.125000e+03 0.000000e+00 1125.0 1125.00 1125.0 1125.0 1.125000e+03
No 8.0 31.375000 0.517549 31.0 31.0 31.0 32.0 32.0 8.0 888.750000 ... 31.25 32.0 8.0 8.887500e+02 4.374500e+02 180.0 888.75 1125.0 1125.0 1.125000e+03
ASB 3.0 2.000000 1.000000 1.0 1.5 2.0 2.5 3.0 3.0 247.000000 ... 2.50 3.0 3.0 2.470000e+02 2.043820e+02 11.0 188.00 365.0 365.0 3.650000e+02
T 2.0 2.000000 0.000000 2.0 2.0 2.0 2.0 2.0 2.0 1125.000000 ... 2.00 2.0 2.0 1.125000e+03 0.000000e+00 1125.0 1125.00 1125.0 1125.0 1.125000e+03
Pendiente 2.0 31.000000 0.000000 31.0 31.0 31.0 31.0 31.0 2.0 140.000000 ... 31.00 31.0 2.0 1.400000e+02 5.656854e+01 100.0 120.00 140.0 160.0 1.800000e+02
NIF 1.0 31.000000 NaN 31.0 31.0 31.0 31.0 31.0 1.0 370.000000 ... 31.00 31.0 1.0 3.700000e+02 NaN 370.0 370.00 370.0 370.0 3.700000e+02
Temporada 1.0 33.000000 NaN 33.0 33.0 33.0 33.0 33.0 1.0 1125.000000 ... 33.00 33.0 1.0 1.125000e+03 NaN 1125.0 1125.00 1125.0 1125.0 1.125000e+03
Tourist 1.0 1.000000 NaN 1.0 1.0 1.0 1.0 1.0 1.0 1125.000000 ... 1.00 1.0 1.0 1.125000e+03 NaN 1125.0 1125.00 1125.0 1125.0 1.125000e+03
Período 1.0 32.000000 NaN 32.0 32.0 32.0 32.0 32.0 1.0 330.000000 ... 31.90 31.9 1.0 3.300000e+02 NaN 330.0 330.00 330.0 330.0 3.300000e+02
Periodo 1.0 32.000000 NaN 32.0 32.0 32.0 32.0 32.0 1.0 330.000000 ... 32.00 32.0 1.0 3.300000e+02 NaN 330.0 330.00 330.0 330.0 3.300000e+02
Numero 1.0 3.000000 NaN 3.0 3.0 3.0 3.0 3.0 1.0 1125.000000 ... 3.00 3.0 1.0 1.125000e+03 NaN 1125.0 1125.00 1125.0 1125.0 1.125000e+03
Ref 1.0 1.000000 NaN 1.0 1.0 1.0 1.0 1.0 1.0 1125.000000 ... 1.00 1.0 1.0 1.125000e+03 NaN 1125.0 1125.00 1125.0 1125.0 1.125000e+03
MINIMUM 1.0 31.000000 NaN 31.0 31.0 31.0 31.0 31.0 1.0 120.000000 ... 31.00 31.0 1.0 1.200000e+02 NaN 120.0 120.00 120.0 120.0 1.200000e+02
Monthly 1.0 60.000000 NaN 60.0 60.0 60.0 60.0 60.0 1.0 1125.000000 ... 32.00 32.0 1.0 1.125000e+03 NaN 1125.0 1125.00 1125.0 1125.0 1.125000e+03
Media 1.0 32.000000 NaN 32.0 32.0 32.0 32.0 32.0 1.0 330.000000 ... 32.00 32.0 1.0 3.300000e+02 NaN 330.0 330.00 330.0 330.0 3.300000e+02
MID 1.0 1.000000 NaN 1.0 1.0 1.0 1.0 1.0 1.0 1125.000000 ... 10.40 10.4 1.0 1.125000e+03 NaN 1125.0 1125.00 1125.0 1125.0 1.125000e+03
LY 1.0 90.000000 NaN 90.0 90.0 90.0 90.0 90.0 1.0 360.000000 ... 90.00 90.0 1.0 3.600000e+02 NaN 360.0 360.00 360.0 360.0 3.600000e+02
LLB 1.0 2.000000 NaN 2.0 2.0 2.0 2.0 2.0 1.0 360.000000 ... 2.00 2.0 1.0 3.600000e+02 NaN 360.0 360.00 360.0 360.0 3.600000e+02
Eri 1.0 2.000000 NaN 2.0 2.0 2.0 2.0 2.0 1.0 1125.000000 ... 2.00 2.0 1.0 1.125000e+03 NaN 1125.0 1125.00 1125.0 1125.0 1.125000e+03
El 1.0 32.000000 NaN 32.0 32.0 32.0 32.0 32.0 1.0 1000.000000 ... 32.00 32.0 1.0 1.000000e+03 NaN 1000.0 1000.00 1000.0 1000.0 1.000000e+03
EP 1.0 31.000000 NaN 31.0 31.0 31.0 31.0 31.0 1.0 300.000000 ... 31.00 31.0 1.0 3.000000e+02 NaN 300.0 300.00 300.0 300.0 3.000000e+02
Carrera 1.0 1.000000 NaN 1.0 1.0 1.0 1.0 1.0 1.0 1125.000000 ... 1.00 1.0 1.0 1.125000e+03 NaN 1125.0 1125.00 1125.0 1125.0 1.125000e+03
asd 1.0 30.000000 NaN 30.0 30.0 30.0 30.0 30.0 1.0 1125.000000 ... 30.00 30.0 1.0 1.125000e+03 NaN 1125.0 1125.00 1125.0 1125.0 1.125000e+03

28 rows × 64 columns

In [ ]:
# grouped_nights.to_excel('grouped nights.xlsx')
In [ ]:
# def fix_other_licenses(min_nights, license):
#   if license != 'HUTB' or license != 'Exempt':
#     if min_nights >= 28:
#       return 'Exempt'
#     else:
#       return 'HUTB'

# fix_other_licenses(32, 'HUTB')
In [ ]:
def fix_other_licenses(min_nights, license):
  if license == 'HUTB':
    return 'HUTB'
  elif license == 'Exempt':
    return 'Exempt'
  else:
    if min_nights >= 28:
      return 'Exempt'
    else:
      return 'HUTB'

fix_other_licenses(32, 'HUTB')
Out[ ]:
'HUTB'
In [ ]:
df['license_cleansed'] = df.apply(lambda row: fix_other_licenses(row['minimum_nights'], row['license_cleansed']), axis=1)
In [ ]:
grouped_nights = df.groupby('license_cleansed').describe()[nights.split('\n')].sort_values(('minimum_nights','count'), ascending=False)
grouped_nights
Out[ ]:
minimum_nights maximum_nights ... minimum_nights_avg_ntm maximum_nights_avg_ntm
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
license_cleansed
Exempt 8903.0 26.043132 37.487728 1.0 3.0 31.0 32.0 1125.0 8903.0 515.466472 ... 32.0 1125.0 8903.0 626.931461 4.357041e+02 1.0 330.0 365.0 1125.0 3.000000e+03
HUTB 8880.0 3.015541 12.133153 1.0 1.0 2.0 3.0 865.0 8880.0 630.362275 ... 3.5 865.0 8880.0 484362.419865 3.222655e+07 1.0 330.0 999.0 1125.0 2.147484e+09

2 rows × 64 columns

Licencje zostały z grubsza oczyszczone. Nie licząc błędów istniejących już w oryginalnie podanych klasach HUTB i Exempt, myślę że tak może już to zostać. Pozostało jeszcze tylko usnięcie license i utworzenie dummies dla license_cleansed.

In [ ]:
dum_license = pd.get_dummies(df["license_cleansed"], drop_first=False)
df = pd.concat([df, dum_license], axis=1)
# df.drop('license_cleansed', axis=1, inplace=True)
df.drop('license', axis=1, inplace=True)
Rodzaj pokoju¶

Robimy one-hot-encoding dla rodzaju pokoju, powinno mieć duże znaczenie dla ceny, w szczególności shared_room / private_room

In [ ]:
dum_room_type = pd.get_dummies(df["room_type"], drop_first=False)
df = pd.concat([df, dum_room_type], axis=1)
# df.drop('ogrzewanie', axis=1, inplace=True)
In [ ]:
df.head()
Out[ ]:
id listing_url scrape_id last_scraped source name description neighborhood_overview picture_url host_id ... wifi kitchen washer license_cleansed Exempt HUTB Entire home/apt Hotel room Private room Shared room
0 18674 https://www.airbnb.com/rooms/18674 20230906022853 2023-09-06 city scrape Rental unit in Barcelona · ★4.30 · 3 bedrooms ... 110m2 apartment to rent in Barcelona. Located ... Apartment in Barcelona located in the heart of... https://a0.muscache.com/pictures/13031453/413c... 71615 ... True True True HUTB 0 1 1 0 0 0
1 23197 https://www.airbnb.com/rooms/23197 20230906022853 2023-09-06 city scrape Rental unit in Sant Adria de Besos · ★4.77 · 3... Beautiful spacious apartment, large terrace, 5... Strategically located in the Parc del Fòrum, a... https://a0.muscache.com/pictures/miso/Hosting-... 90417 ... True True True HUTB 0 1 1 0 0 0
2 117010 https://www.airbnb.com/rooms/117010 20230906022853 2023-09-06 previous scrape Rental unit in Barcelona · ★4.55 · 3 bedrooms ... Have an authentic Barcelona experience! Modern... Apartment is located just 240 meters from Sagr... https://a0.muscache.com/pictures/4052d8b5-7352... 567180 ... True True True HUTB 0 1 1 0 0 0
3 32711 https://www.airbnb.com/rooms/32711 20230906022853 2023-09-06 city scrape Rental unit in Barcelona · ★4.46 · 2 bedrooms ... A lovely two bedroom apartment only 250 m from... What's nearby <br />This apartment is located... https://a0.muscache.com/pictures/357b25e4-f414... 135703 ... True True True HUTB 0 1 1 0 0 0
4 118228 https://www.airbnb.com/rooms/118228 20230906022853 2023-09-06 previous scrape Rental unit in Barcelona · ★4.56 · 3 bedrooms ... Modern 100m2 recently decorated apartment with... Apartment is located just 240 meters from Sagr... https://a0.muscache.com/pictures/c4b34854-5db1... 567180 ... True True True HUTB 0 1 1 0 0 0

5 rows × 91 columns

Widok na kolumny dat¶

  • niektóre z nich mają tylko jedną datę, np. last_scraped
  • last_scraped i calendar_last_scraped są jednakowe, i nie mają żadnych NaN, generalnie można usunąć
In [ ]:
date_c = """last_scraped
host_since
calendar_last_scraped
first_review
last_review"""
date_c = date_c.split('\n')
date_c
Out[ ]:
['last_scraped',
 'host_since',
 'calendar_last_scraped',
 'first_review',
 'last_review']
In [ ]:
df[date_c].describe()
Out[ ]:
last_scraped host_since calendar_last_scraped first_review last_review
count 17783 17781 17783 13412 13412
unique 1 3257 1 3342 1604
top 2023-09-06 2020-05-15 2023-09-06 2023-07-09 2023-08-20
freq 17783 294 17783 36 367
In [ ]:
df[date_c].isna().sum()
Out[ ]:
last_scraped                0
host_since                  2
calendar_last_scraped       0
first_review             4371
last_review              4371
dtype: int64
In [ ]:
fig = px.histogram(df[['first_review','last_review']],
                   )

fig.update_layout(
    width=900, height=500, template='plotly_dark', bargap=0.05,
    title='Histogram recenzji',
    xaxis_title="",
    yaxis_title="Częstość",
    legend=dict(title=''))
In [ ]:
# df['host_since'].max()
old = df[df['host_since'].notnull()]['host_since'].min()
new = df[df['host_since'].notnull()]['host_since'].max()
(new.year - old.year) * 12 - (new.month - old.month)
Out[ ]:
180
In [ ]:
fig = px.histogram(df['host_since'], color_discrete_sequence=[px.colors.qualitative.Plotly[9]], nbins = 180)

fig.update_layout(
    width=900, height=500, template='plotly_dark', bargap=0.05,
    title='Histogram rozpoczęcia działalności gospodarzy',
    xaxis_title="",
    yaxis_title="Ilość gospodarzy",
    legend=dict(title=''))
In [ ]:
x.at[1, 'washer'] = 3
x
Out[ ]:
values washer
0 count 17783
1 unique 3
2 % NaN 0.0
3 True 14839
4 False 2944

Dystans¶

Sprawdziłem kilka miejsc, żeby sprawdzić Plaça de Catalunya jako centrum Barcelony, do którego można będzie policzyć dystans z każdej oferty.

Metoda płaska:

In [ ]:
def get_distance(lon, lat):
  # plaza = [41.387016, 2.170047]
  sagrada = [41.4036, 2.1744]
  return (((lon - sagrada[0])**2 + (lat - sagrada[1])**2)**0.5) * 111139

print(get_distance(41.40556, 2.17262), '(m)')
294.2561129702953 (m)

Metoda Haversina $$ d = 2r\;arcsin \left( \sqrt{sin^2 \frac{\phi_2 - \phi_1}{2} + cos \phi_1 \times cos\phi_2 \times sin^2\frac{\lambda_2 - \lambda_1}{2}} \right)$$

Wychodzi bardzo podobnie, Haversin jest przydatny przy dużo większych dystansach, a jest bardziej kosztowny obliczniowo. (Być może jest biblioteka która policzyła by to nie w Pythonie).

In [ ]:
import math

def get_haversin_distance(lon1, lat1, lon2, lat2):
    R = 6371
    dLat = math.radians(lat2 - lat1)
    dLon = math.radians(lon2 - lon1)
    lat1 = math.radians(lat1)
    lat2 = math.radians(lat2)
    a = math.sin(dLat / 2) * math.sin(dLat / 2) + \
        math.sin(dLon / 2) * math.sin(dLon / 2) * math.cos(lat1) * math.cos(lat2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    distance = R * c
    return distance

print(get_haversin_distance(41.40556, 2.17262, 41.4036, 2.1744), '(km)')
0.2942881317311864 (km)

Tworzymy nową kolumnę center_distance, Sprawdzenie które miejsce jest najlepsze

In [ ]:
def get_distance(lon, lat):
  plaza = [41.387016, 2.170047] # 0.1421 zostajemy przy Plaza de Cataluña
  sagrada = [41.4036, 2.1744] # 0.0868
  gotic = [41.3833, 2.1767] # 0.1405
  mila = [41.3954, 2.1618] # 0.0960
  vicens = [41.4029, 2.1526] # 0.01
  MuseuMares = [41.3844, 2.1778] # 0.1437
  plazadejoan = [41.3873, 2.1881] #0.1322
  picasso = [41.3852, 2.1809] #0.1435

  place = plaza
  return (((lon - place[0])**2 + (lat - place[1])**2)**0.5) * 111139

print(get_distance(41.40556, 2.17262), '(m)')

df['center_distance'] = get_distance(df['latitude'], df['longitude'])
df['price'].corr(df['center_distance'])
2080.7057154382396 (m)
Out[ ]:
-0.11007620393122651
In [ ]:
# def get_distance(lon, lat, place):
#   # plaza = [41.387016, 2.170047]
#   # sagrada = [41.4036, 2.1744]
#   return (((lon - place[0])**2 + (lat - place[1])**2)**0.5) * 111139

# # print(get_distance(41.40556, 2.17262), '(m)')



# places = {
# 'Basilica de la Sagrada Familia' : [41.4036, 2.1744],
# 'Barri Gòtic' : [41.3833, 2.1767],
# 'Casa Milà' : [41.3954, 2.1618],
# 'Casa Vicens' : [41.4029, 2.1526],
# 'Museu Frederic Marès': [41.3844, 2.1778],
# }

# for p in places:
#   print(p, places[p], df['price'].corr(
#       get_distance(df['latitude'], df['longitude'], places[p])
#   ))

Brakujące dane¶

  • host_neighbourhood, neighbourhood i neighborhood_overview, host_about można spokojnie usunąć, mamy neighborhood_cleansed, host nie powinien być zbyt istotny. Można by się zastanowić, czy są jakieś praktyki gospodarzy, które mogłyby wpływać na cenę, np. czy żądają kaucji za zniszczenia mebli lub czy potrafią mówić po angielsku.
  • problematyczna jest ilość NaN w bedrooms i beds; została uzupełniona na podstawie accomodates
  • w zestawie danych tylko dwa znajdują wiersze z brakiem informacji dot. gospodarza m. in. host_identity_verified, host_listings_count, host_picture_url, host_name, host_since
In [ ]:
n = df.isna().sum().sort_values(ascending=False)
n[n > 0]

fig = px.bar(n[n>2])

fig.update_layout(
    width=1300, height=700, template='plotly_dark', bargap=0.05,
    title='Wykres brakujących wartości',
    xaxis_title="Kolumna",
    yaxis_title="Ilość NaN",
    legend=dict(title=''))

fig
In [ ]:
n = df.isna().sum().sort_values(ascending=False)
n[n > 0]
Out[ ]:
host_neighbourhood             8402
neighbourhood                  7821
neighborhood_overview          7821
host_about                     6527
bedrooms                       6098
review_scores_checkin          4453
review_scores_value            4453
review_scores_location         4452
review_scores_accuracy         4449
review_scores_cleanliness      4448
review_scores_communication    4447
first_review                   4371
reviews_per_month              4371
last_review                    4371
review_scores_rating           4371
host_location                  3982
host_response_time             2812
host_response_rate             2812
host_acceptance_rate           2535
host_is_superhost               418
beds                            307
description                     109
host_has_profile_pic              2
host_identity_verified            2
host_thumbnail_url                2
host_total_listings_count         2
host_listings_count               2
host_picture_url                  2
host_since                        2
host_name                         2
dtype: int64
  • weryfikacja czy kolumny z 2 NaN to są tylko dwa wiersze:
    • tak, to te same
In [ ]:
s = df.query('host_name.isna()').iloc[0]
s[s.isna()]
Out[ ]:
host_name                      NaN
host_since                     NaN
host_location                  NaN
host_about                     NaN
host_response_time             NaN
host_response_rate             NaN
host_acceptance_rate           NaN
host_thumbnail_url             NaN
host_picture_url               NaN
host_neighbourhood             NaN
host_listings_count            NaN
host_total_listings_count      NaN
host_has_profile_pic           NaN
host_identity_verified         NaN
bedrooms                       NaN
review_scores_accuracy         NaN
review_scores_cleanliness      NaN
review_scores_checkin          NaN
review_scores_communication    NaN
review_scores_location         NaN
review_scores_value            NaN
Name: 2038, dtype: object
In [ ]:
s = df.query('host_name.isna()').iloc[1]
s[s.isna()]
Out[ ]:
neighborhood_overview          NaN
host_name                      NaN
host_since                     NaN
host_location                  NaN
host_about                     NaN
host_response_time             NaN
host_response_rate             NaN
host_acceptance_rate           NaN
host_thumbnail_url             NaN
host_picture_url               NaN
host_neighbourhood             NaN
host_listings_count            NaN
host_total_listings_count      NaN
host_has_profile_pic           NaN
host_identity_verified         NaN
neighbourhood                  NaN
bedrooms                       NaN
first_review                   NaN
last_review                    NaN
review_scores_rating           NaN
review_scores_accuracy         NaN
review_scores_cleanliness      NaN
review_scores_checkin          NaN
review_scores_communication    NaN
review_scores_location         NaN
review_scores_value            NaN
reviews_per_month              NaN
Name: 2064, dtype: object

Uzupełnienia brakujących beds i bedrooms po średnich wartościach dla accommodates¶

In [ ]:
merged_beds_df = pd.merge(
    pd.merge(
         df.groupby('accommodates').describe()['beds'],
         df.groupby('accommodates')['beds'].apply(lambda x: x.isna().sum()),
         on='accommodates').rename(columns={'beds':'NaN beds'}),
    pd.merge(
         df.groupby('accommodates').describe()['bedrooms'],
         df.groupby('accommodates')['bedrooms'].apply(lambda x: x.isna().sum()),
         on='accommodates').rename(columns={'bedrooms':'NaN bedrooms'}),
    on='accommodates')
merged_beds_df
Out[ ]:
count_x mean_x std_x min_x 25%_x 50%_x 75%_x max_x NaN beds count_y mean_y std_y min_y 25%_y 50%_y 75%_y max_y NaN bedrooms
accommodates
1 2493.0 1.199759 1.047147 1.0 1.0 1.0 1.00 26.0 116 484.0 1.826446 2.032562 1.0 1.0 1.0 1.00 9.0 2125
2 5689.0 1.198453 0.522178 1.0 1.0 1.0 1.00 16.0 137 2547.0 1.063997 0.276444 1.0 1.0 1.0 1.00 4.0 3279
3 1474.0 2.101085 0.664255 1.0 2.0 2.0 2.00 6.0 13 1137.0 1.611258 0.590548 1.0 1.0 2.0 2.00 6.0 350
4 3673.0 2.560305 0.876790 1.0 2.0 2.0 3.00 8.0 27 3437.0 1.875182 0.634994 1.0 1.0 2.0 2.00 10.0 263
5 1345.0 3.581413 0.966452 1.0 3.0 4.0 4.00 15.0 5 1328.0 2.578313 0.636645 1.0 2.0 3.0 3.00 5.0 22
6 1631.0 4.148375 1.162316 1.0 3.0 4.0 5.00 12.0 4 1603.0 2.744230 0.656825 1.0 2.0 3.0 3.00 6.0 32
7 319.0 5.175549 1.249112 1.0 4.0 5.0 6.00 10.0 1 315.0 3.396825 0.631944 2.0 3.0 3.0 4.00 5.0 5
8 471.0 5.636943 1.390635 1.0 5.0 6.0 6.00 13.0 2 464.0 3.482759 0.757771 1.0 3.0 4.0 4.00 7.0 9
9 78.0 6.346154 1.885191 1.0 5.0 7.0 7.75 10.0 1 73.0 4.000000 1.067187 1.0 3.0 4.0 5.00 6.0 6
10 165.0 7.096970 1.715156 4.0 6.0 7.0 8.00 13.0 1 162.0 4.160494 1.027048 1.0 4.0 4.0 5.00 7.0 4
11 12.0 7.750000 1.356801 6.0 7.0 8.0 8.00 11.0 0 12.0 4.083333 0.996205 2.0 4.0 4.0 4.25 6.0 0
12 41.0 9.756098 2.817627 5.0 8.0 9.0 11.00 18.0 0 40.0 4.975000 1.510434 1.0 4.0 5.0 6.00 9.0 1
13 9.0 8.888889 1.763834 5.0 8.0 9.0 10.00 11.0 0 9.0 5.333333 1.224745 4.0 5.0 5.0 6.00 8.0 0
14 20.0 10.600000 3.377947 5.0 8.5 10.0 14.00 18.0 0 19.0 5.421053 1.426565 1.0 5.0 6.0 6.00 7.0 1
15 23.0 12.869565 2.735363 8.0 11.0 13.0 15.00 18.0 0 23.0 6.347826 1.612697 4.0 6.0 6.0 6.00 10.0 0
16 33.0 15.030303 5.370550 1.0 12.0 15.0 18.00 30.0 0 32.0 8.093750 1.940392 4.0 6.0 8.0 9.00 12.0 1
In [ ]:
merged_beds_df = merged_beds_df[['mean_x', 'mean_y']]
merged_beds_df['mean_x'] = merged_beds_df['mean_x'].round(decimals=0)
merged_beds_df['mean_y'] = merged_beds_df['mean_y'].round(decimals=0)
merged_beds_df = merged_beds_df.rename(columns={'mean_x':'avg_beds', 'mean_y':'avg_bedrooms'})
merged_beds_df
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Out[ ]:
avg_beds avg_bedrooms
accommodates
1 1.0 2.0
2 1.0 1.0
3 2.0 2.0
4 3.0 2.0
5 4.0 3.0
6 4.0 3.0
7 5.0 3.0
8 6.0 3.0
9 6.0 4.0
10 7.0 4.0
11 8.0 4.0
12 10.0 5.0
13 9.0 5.0
14 11.0 5.0
15 13.0 6.0
16 15.0 8.0
In [ ]:
avg_beds_bedrooms = merged_beds_df.to_dict()
avg_beds_bedrooms
Out[ ]:
{'avg_beds': {1: 1.0,
  2: 1.0,
  3: 2.0,
  4: 3.0,
  5: 4.0,
  6: 4.0,
  7: 5.0,
  8: 6.0,
  9: 6.0,
  10: 7.0,
  11: 8.0,
  12: 10.0,
  13: 9.0,
  14: 11.0,
  15: 13.0,
  16: 15.0},
 'avg_bedrooms': {1: 2.0,
  2: 1.0,
  3: 2.0,
  4: 2.0,
  5: 3.0,
  6: 3.0,
  7: 3.0,
  8: 3.0,
  9: 4.0,
  10: 4.0,
  11: 4.0,
  12: 5.0,
  13: 5.0,
  14: 5.0,
  15: 6.0,
  16: 8.0}}
In [ ]:
print(df['beds'].isna().sum())
print(df['bedrooms'].isna().sum())
307
6098
In [ ]:
def beds_fix(current_beds, accomodates):
  if pd.isna(current_beds):
      return avg_beds_bedrooms['avg_beds'][accomodates]
  else:
    return current_beds

def bedroomss_fix(current_bedrooms, accomodates):
  if pd.isna(current_bedrooms):
      return avg_beds_bedrooms['avg_bedrooms'][accomodates]
  else:
    return current_bedrooms

beds_fix(None, 4)
Out[ ]:
3.0
In [ ]:
print(df['beds'].isna().sum(), df['bedrooms'].isna().sum())

df['beds'] = df.apply(lambda row: beds_fix(row['beds'], row['accommodates']), axis=1)
df['bedrooms'] = df.apply(lambda row: beds_fix(row['bedrooms'], row['accommodates']), axis=1)

print(df['beds'].isna().sum(), df['bedrooms'].isna().sum())
307 6098
0 0
In [ ]:
n = df.isna().sum().sort_values(ascending=False)
n[n > 0]

fig = px.bar(n[n>2])

fig.update_layout(
    width=1300, height=700, template='plotly_dark', bargap=0.05,
    title='Wykres brakujących wartości',
    xaxis_title="Kolumna",
    yaxis_title="Ilość NaN",
    legend=dict(title=''))

fig
In [ ]:
df.query('number_of_reviews==0').shape
Out[ ]:
(4371, 92)

Tak więc łóżka i sypialnie zostały rozwiązane. Pozostała jedynie kwestia recenzji. Ilość brakujących cech dotyczących recenzji jest bardzo zbliżona do siebie ~4450 i 4371. Są to wartości niedostępne ze względu na zupełny brak wystawionych recenzji.

Ponieważ korelacja ceny jest najwyższa z ilością wystawionych recenzji a nie faktyczną wartością recenzji, myślę że spokojnie można usunąć te kolumny, i zostawić jedynie number_of_reviews i number_of_reviews_ltm.

Wizualizacja¶

In [ ]:
from pandas.api.types import is_numeric_dtype
In [ ]:
numeric = []

for i in df.columns:
  if is_numeric_dtype(df[i]):
    numeric.append(i)
In [ ]:
numeric.remove('id',)
numeric.remove('scrape_id')
numeric.remove('host_id')
In [ ]:
numeric_df = df[numeric]
In [ ]:
# numeric.remove('price')
In [ ]:
ndf = df.dropna()
In [ ]:
ndf.shape
Out[ ]:
(3339, 92)
In [ ]:
numeric_df.shape
Out[ ]:
(17783, 63)
In [ ]:
corr_df = numeric_df.corr(method='spearman', min_periods=33)
In [ ]:
fig = px.imshow(corr_df.round(decimals=2) * 100, # w skali 1-100 żeby zajmowało mniej miejsca w kwadracie
                color_continuous_scale='RdBu_r',
                text_auto=True,
                labels=dict(color="Correlation"),
                x=numeric_df.columns,
                y=numeric_df.columns)

fig.update_traces(textfont_size=8)
fig.update_xaxes(tickfont=dict(size=9))
fig.update_yaxes(tickfont=dict(size=9))

fig.update_layout(height=1200, )
fig.show()
In [ ]:
corr_df.columns
Out[ ]:
Index(['host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_listings_count', 'host_total_listings_count',
       'host_has_profile_pic', 'host_identity_verified', 'latitude',
       'longitude', 'accommodates', 'bedrooms', 'beds', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'has_availability', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'instant_bookable',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'reviews_per_month',
       'private_bathrooms', 'shared_bathrooms', 'total_bathrooms',
       'refrigerator', 'tv', 'iron', 'crib', 'parking', 'elevator',
       'AC/heating', 'wifi', 'kitchen', 'washer', 'Exempt', 'HUTB',
       'Entire home/apt', 'Hotel room', 'Private room', 'Shared room',
       'center_distance'],
      dtype='object')
In [ ]:
# fig = px.bar(corr_df.round(decimals=2)[1:])
fig = px.bar(numeric_df.corr(method='spearman', min_periods=33)['price'].sort_values(ascending=False)[1:],
             color_discrete_sequence=[px.colors.qualitative.Plotly[3]])
fig.update_yaxes(range=[-0.75,0.75])
fig.update_xaxes(tickangle=50)
fig.update_layout(
    showlegend=False,
    title=dict(text="Korelacja cech wobec cen ofert", font=dict(size=22), automargin=True, yref='paper'),
    xaxis_title="Cecha",
    yaxis_title="Korelacja",
    template='plotly_dark',
    height=660,
    width=1920
)
Wnioski z korelacji wobec ceny:¶
  • Ilość osób / łóżek najbardziej wpływa na cenę
  • Licencje mają wysokie korelacje $\pm$ 0.52 (kolumna Exempt zostanie usunięta przed modelowaniem)
  • Rodzaj oferty jest bardzo istotny - Private_room i Entire home/apt mają wysokie korelacje: $\pm$ 0,57, pozostałe rodzaje już mają mniejsze, np. Shared_room ma zaledwie -0.07
  • Podział łazienek okazał się bardzo sensowny - private_bathrooms ma z nich najwyższą korelację 0.5, total_bathrooms 0.19, a shared_bathrooms -0.48
    • Oferty z shared_bathrooms są przeważająco mniej luksusowe - niższa cena, liczba dostępnych gości, brak udogodnień. Oferty podobne do hosteli
  • Dystans (center_distance) ma korelację ok. 0.14
  • Rozdzielone Amenities ma duży rozrzut koreleacji,
    • Największą ma crib która bardzo wysoko koreluje z accomodates - małe dzieci są uwzględniane jako normalny gość
    • wifi zdaje się być standardem - brak korelacji
    • kitchen jest jedynym udogodnieniem z ujemną korelacją (co prawda praktyncznie nieważną) - pewnie mniej 'hotelowe' oferty
  • Spośród kolumn ilości nocy - minimum_nights, maximum_minimum_nights etc. mają wysokie ujemne korelacje z ceną i oczywiście HUTB

    • Wydaje mi się że najsensowniej byłoby pozostawić tylko maximum_nights i minimum_nights - teoretycznie gdyby ktoś korzystał z modelu żeby określić cenę wynajmu dla danej oferty, częściej będzie miał te wartości niż maximum_maximum_nights itp. Na przykład ktoś chciałby wycenić swoje mieszkanie na wynajem. Te cechy były by przydatne dla kogoś, kto już wynajmował mieszkanie i chce zaktualizować cenę wynajmu. -host_response_time również ładnie się przedstawia z korelacją -0.29, dla przypomnienia, tak wyglądał ordinal encoding:
      response_ordinal = {
      'within an hour' : 1,
      'within a few hours' : 2,
      'within a day' : 3,
      'a few days or more' : 4
      }
  • calculated_host_listings_count_private_rooms - druga najwyższa ujemna korelacja, wynika z podobnej logiki co private vs shared_bathrooms - same pokoje są tańsze niż całe mieszkania, czyli calculated_host_listings_count_entire_homes

  • Nieoczekiwanie review_scores_rating jest zupełnie nieistotne: korelacja równa -0.03, podobnie review_score_rating (które dotyczy oceny względem kosztu noclegu).
  • Cechy dotyczące recenzji które mają korelację z ceną to number_of_reviews_ltm i number_of_reviews, które również wysoce korelują ze sobą (0.83).
    • Przypuszczam, że wysoka korelacja ilości recenzji z ostatniego roku (ltm - last 12 months) z ceną wynika z ogólnego wzrostu cen praktycznie wszystkiego.
    • Korelacja ogólnej ilości recenzji z ceną może też wynikać z możliwości podniesienia ceny przez gospodarza, który w momencie gdy zdobył popularność / pozytywne recenzje, podniósł ceny noclegu.

Korelacja dzielnic¶

In [ ]:
print(df.shape)
dum_neighs = pd.get_dummies(df["neighbourhood_group_cleansed"], drop_first=False)
df = pd.concat([df, dum_neighs], axis=1)
df.shape
(17783, 92)
Out[ ]:
(17783, 102)
In [ ]:
districts = df.columns[-10:].tolist()
districts.append('price')
In [ ]:
fig = px.bar(df[districts].corr(method='spearman', min_periods=33)['price'].sort_values(ascending=False)[1:],
             color_discrete_sequence=[px.colors.qualitative.Plotly[3]])
fig.update_yaxes(range=[-0.5,0.5])
fig.update_layout(
    showlegend=False,
    title=dict(text="Korelacja dzielnic do cen", font=dict(size=22), automargin=True, yref='paper'),
    xaxis_title="Dzielnice",
    yaxis_title="Korelacja",
    template='plotly_dark',
    width=800,
    height=400
)

Jak widać dzielnice nie mają wysokiej korelacji, z najwyższą dla Eixample równą 0.19. Ponieważ dzielnice są istotne dla innych cech, np. center_distance myślę że warto je uwzględnić jako dummies w modelu. Ewentualnie możnaby zostawić jedynie kolumnę Eixample dla uproszczenia modelu.

Współczynnik Cramera dla cech kategorycznych¶

In [ ]:
from scipy.stats import chi2_contingency

def cramers_V(var1,var2) :
  crosstab =np.array(pd.crosstab(var1,var2, rownames=None, colnames=None)) # Cross table building
  stat = chi2_contingency(crosstab)[0] # Keeping of the test statistic of the Chi2 test
  obs = np.sum(crosstab) # Number of observations
  mini = min(crosstab.shape)-1 # Take the minimum value between the columns and the rows of the cross table
  return np.sqrt(stat/(obs*mini))
In [ ]:
nominal_cols = [
    'host_is_superhost',
    'HUTB',
    'Entire home/apt', 'Private room', 'Hotel room', 'Shared room',
    'refrigerator', 'tv', 'iron', 'crib', 'parking', 'elevator', 'AC/heating', 'wifi', 'kitchen', 'washer'
    ]
nominal_df = numeric_df[nominal_cols]
# nominal_df
In [ ]:
rows= []

for var1 in nominal_df:
  col = []
  for var2 in nominal_df :
    cramers =cramers_V(nominal_df[var1], nominal_df[var2])
    col.append(round(cramers,2))
  rows.append(col)

cramers_results = np.array(rows)
cramer_df = pd.DataFrame(cramers_results, columns = nominal_df.columns, index =nominal_df.columns)

# cramer_df
In [ ]:
fig = px.imshow(cramer_df, text_auto=True,)

fig.update_traces(textfont_size=10)

fig.update_layout(height=1100, width=1100,
                  title='Współczynnik V Cramera dla wybranych cech kategorycznych')
fig.show()

Nie powiedziałbym że ten wykres jest szczególnie przydatny

Mapy¶

In [ ]:
gdf.head(3)
Out[ ]:
index neighbourhood neighbourhood_group geometry
0 0 el Raval Ciutat Vella MULTIPOLYGON (((2.17737 41.37532, 2.17852 41.3...
1 1 el Barri Gòtic Ciutat Vella MULTIPOLYGON (((2.18287 41.38074, 2.18289 41.3...
2 2 la Dreta de l'Eixample Eixample MULTIPOLYGON (((2.17091 41.40182, 2.17332 41.3...
In [ ]:
mean_neighs = df.groupby('neighbourhood_cleansed')['price'].mean()
mean_neighs = mean_neighs.to_dict()
c_mean_neighs = df.groupby('neighbourhood_cleansed')['price'].describe()[['count', 'mean']].to_dict()
In [ ]:
def get_mean_neigh(neigh_name):
  if neigh_name in mean_neighs:
    return mean_neighs[neigh_name]
  else:
    return None


def c_get_mean_neigh(neigh_name):
  if neigh_name in c_mean_neighs['mean']:
    return c_mean_neighs['mean'][neigh_name]
  else:
    return None

def c_get_count_neigh(neigh_name):
  if neigh_name in c_mean_neighs['mean']:
    return c_mean_neighs['count'][neigh_name]
  else:
    return 0
In [ ]:
c_get_mean_neigh('el Turó de la Peira')
Out[ ]:
40.63157894736842
In [ ]:
gdf['mean_neigh'] = gdf['neighbourhood'].apply(lambda x: get_mean_neigh(x))
gdf['count_neigh'] = gdf['neighbourhood'].apply(lambda x: c_get_count_neigh(x))
gdf = gdf.rename(columns={'mean_neigh':'Średnia cena za nocleg w dzielnicy [$]', 'count_neigh':'Ilość ofert w dzielnicy'})
gdf.head(3)
Out[ ]:
index neighbourhood neighbourhood_group geometry Średnia cena za nocleg w dzielnicy [$] Ilość ofert w dzielnicy
0 0 el Raval Ciutat Vella MULTIPOLYGON (((2.17737 41.37532, 2.17852 41.3... 111.116424 1443.0
1 1 el Barri Gòtic Ciutat Vella MULTIPOLYGON (((2.18287 41.38074, 2.18289 41.3... 133.708436 1221.0
2 2 la Dreta de l'Eixample Eixample MULTIPOLYGON (((2.17091 41.40182, 2.17332 41.3... 206.169538 2206.0
In [ ]:
fig = px.choropleth_mapbox(gdf, geojson=geojson,
                           locations=gdf.index,
                           color='Ilość ofert w dzielnicy',
                           hover_name='neighbourhood_group',
                           hover_data={'neighbourhood': False, 'neighbourhood_group': False, 'index': False, 'Ilość ofert w dzielnicy':True},
                           featureidkey='properties.index',
                           mapbox_style='carto-positron',
                           center={'lat': 41.3933, 'lon': 2.145},
                           zoom=10.85,
                           opacity=0.75)
fig.update_traces(
    name='Średnia cena za nocleg', selector=(dict(type='choroplethmapbox'))
)
fig.update_layout(
    height=800, width=800,
    # template='plotly_dark',
    bargap=0.05,
    title='Mapa dzielnic (mniejszych) Barcelony')

fig.show()
In [ ]:
fig = px.choropleth_mapbox(gdf, geojson=geojson,
                           locations=gdf.index,
                           color='Średnia cena za nocleg w dzielnicy [$]',
                           hover_name='neighbourhood_group',
                           hover_data={'neighbourhood': False, 'neighbourhood_group': False, 'index': False, 'Ilość ofert w dzielnicy':True},
                           featureidkey='properties.index',
                           mapbox_style='carto-positron',
                           center={'lat': 41.3933, 'lon': 2.145},
                           zoom=10.85,
                           opacity=0.75)
fig.update_traces(
    name='Średnia cena za nocleg', selector=(dict(type='choroplethmapbox'))
)
fig.update_layout(
    height=800, width=1000,
    # template='plotly_dark',
    bargap=0.05,
    title='Mapa dzielnic (mniejszych) Barcelony')

fig.show()
In [ ]:
fig = px.choropleth_mapbox(gdf, geojson=geojson,
                           locations=gdf.index,
                           color='neighbourhood_group',
                           hover_name='neighbourhood_group',
                           hover_data={'neighbourhood': False, 'neighbourhood_group': False, 'index': False},
                           featureidkey='properties.index',
                           mapbox_style='carto-positron',
                           center={'lat': 41.3933, 'lon': 2.145},
                           zoom=10.85,
                           opacity=0.75)

fig.update_layout(
    height=800, width=800,
    # template='plotly_dark',
    bargap=0.05,
    title='Mapa dzielnic Barcelony')

fig.show()

Wizualizacja innych cech¶

In [ ]:
# !pip install sweetviz -qq
# import sweetviz as sv
# import warnings
# warnings.filterwarnings('ignore')

# my_report = sv.analyze(df, pairwise_analysis="off")
# my_report.show_notebook()
In [ ]:
fig = px.histogram(df, x='price', color_discrete_sequence=[px.colors.qualitative.Plotly[2]])
fig.update_xaxes(title_text='Cena za nocleg [$]')
fig.update_yaxes(title_text='Częstość')
fig.update_layout(title='Histogram cen za nocleg',
                  width=1100, height=400,)
In [ ]:
grouped_amenities = df.groupby('license_cleansed')[[i for i in amenities]].sum().transpose()
grouped_amenities['sum'] = grouped_amenities['Exempt'] + grouped_amenities['HUTB']
grouped_amenities = grouped_amenities.sort_values('sum', ascending=False)
fig = px.bar(grouped_amenities[['Exempt','HUTB']])

fig.update_layout(height=500, width=1100,
                  title='Ilość udogodnień',
                  xaxis_title="Udogodnienie",
                  yaxis_title="Ilość",
                  legend_title = 'Licencja')
In [ ]:
fig = px.histogram(df, x='room_type', y='price', color='license_cleansed', barmode='group', histfunc='avg')

fig.update_layout(height=500, width=1100,
                  title='Cena za nocleg zależnie od rodzaju zakwaterowania i rodzaju licencji',
                  xaxis_title="Rodzaj zakwaterowania",
                  yaxis_title="Średnia cena za nocleg [$]",
                  legend_title = 'Licencja')

fig.update_xaxes(categoryorder='array', categoryarray=['Hotel room', 'Entire home/apt', 'Shared room', 'Private room'])
In [ ]:
fig = px.box(df, x='room_type', y='price', color='license_cleansed', )
fig.update_traces(jitter=0.25, marker=dict(size=3))
fig.update_layout(height=500, width=1100,
                  title='Wykres pudełkowy ceny za nocleg w zależności od licencji i rodzaju zakwaterowania',
                  xaxis_title="Rodzaj zakwaterowania",
                  yaxis_title="Cena za nocleg [$]",
                  legend_title = 'Licencja')
fig
In [ ]:
fig = px.box(df.query('price <=500'), x='room_type', y='price', color='license_cleansed', )
fig.update_traces(jitter=0.25, marker=dict(size=3))
fig.update_layout(height=500, width=1100,
                  title='Wykres pudełkowy ceny (do $500/noc) za nocleg w zależności od licencji i rodzaju zakwaterowania',
                  xaxis_title="Rodzaj zakwaterowania",
                  yaxis_title="Cena za nocleg [$]",
                  legend_title = 'Licencja')
fig
In [ ]:
grouped_distr = pd.merge(
    df.groupby('neighbourhood_group_cleansed')['price'].describe(),
    df.groupby('neighbourhood_group_cleansed')['center_distance'].describe(),
    on='neighbourhood_group_cleansed',
).sort_values(by='mean_x', ascending=False)

#grouped_distr
In [ ]:
grouped_distr
Out[ ]:
count_x mean_x std_x min_x 25%_x 50%_x 75%_x max_x count_y mean_y std_y min_y 25%_y 50%_y 75%_y max_y
neighbourhood_group_cleansed
Eixample 6393.0 169.692320 150.755844 20.0 66.00 147.0 221.00 2000.0 6393.0 1492.811202 640.788493 23.451382 1039.855381 1453.586658 1991.372788 3112.951897
Sarrià-Sant Gervasi 880.0 149.497727 172.308934 20.0 59.00 122.5 183.00 1500.0 880.0 3694.034081 1252.484645 1921.810642 2876.642777 3371.226469 4139.976383 9868.081543
Sant Martí 1619.0 145.017912 159.309617 20.0 54.00 100.0 190.50 2000.0 1619.0 3751.391000 1077.494645 1600.817873 2874.949281 3568.740549 4404.874294 6868.509183
Gràcia 1556.0 142.798843 154.641592 20.0 52.00 120.5 180.00 1500.0 1556.0 2373.654962 654.081453 1566.727177 1902.758078 2194.347859 2636.622502 5682.112705
Les Corts 351.0 133.846154 106.328893 20.0 62.00 113.0 166.00 1000.0 351.0 4386.130585 926.871102 2984.518282 3615.895271 4310.096066 4992.869648 7229.699586
Ciutat Vella 4188.0 124.504059 133.453096 20.0 51.00 80.0 150.25 1600.0 4188.0 1095.661683 517.282337 132.004175 761.616842 1032.218828 1291.133506 2979.583787
Sants-Montjuïc 1840.0 117.895109 125.607801 20.0 50.00 80.5 157.25 1790.0 1840.0 2983.038600 1235.827991 1334.991640 1738.282970 3030.268516 4056.355592 5411.672584
Horta-Guinardó 519.0 93.630058 134.332088 20.0 40.00 60.0 104.00 1800.0 519.0 3690.854151 932.979459 2251.691936 2919.209270 3550.261859 4219.209881 6883.726557
Sant Andreu 248.0 75.838710 72.944867 20.0 34.75 55.0 85.25 700.0 248.0 4897.537865 1047.854364 3622.339033 4037.163865 4516.946084 5767.867382 7900.081680
Nou Barris 189.0 73.206349 122.095423 20.0 30.00 50.0 75.00 1533.0 189.0 5550.214267 852.126785 4399.795130 4836.235167 5266.269047 6241.164125 8104.114307
In [ ]:
pcr_df = df.groupby(['neighbourhood_group_cleansed', 'license_cleansed'])['price', 'center_distance', 'review_scores_rating'].mean().reset_index()
Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
In [ ]:
count_ddf = df.groupby(['neighbourhood_group_cleansed', 'license_cleansed'])['price', 'center_distance', 'review_scores_rating'].count().reset_index()
Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
In [ ]:
dist_count_order = df.groupby('neighbourhood_group_cleansed')['id'].count().sort_values(ascending=False).index.tolist()
# dist_count_order
In [ ]:
fig = px.bar(count_ddf, x='neighbourhood_group_cleansed', y='price', color='license_cleansed')
fig.update_layout(title='Ilość ofert w danej dzielnicy',
                  width=1100, height=450,
                  xaxis_title='',
                  yaxis_title= "Ilość ofert",
                  legend_title = 'Licencja'
                  )
fig.update_xaxes(categoryorder='array', categoryarray=dist_count_order)
fig
In [ ]:
fig = px.bar(pcr_df, x='neighbourhood_group_cleansed', y='review_scores_rating', color='license_cleansed', barmode='group')
fig.update_layout(title='Średnia ocena zależnie od licencji dla danej dzielnicy',
                  width=1100, height=450,
                  xaxis_title='',
                  yaxis_title= "Średnia ocena",
                  legend_title = 'Licencja'
                  )

fig.update_xaxes(categoryorder='array', categoryarray=['Hotel room', 'Entire home/apt', 'Shared room', 'Private room'])

fig.show()
In [ ]:
fig = px.box(df.sort_values(by='license_cleansed'), x='neighbourhood_group_cleansed', y='review_scores_rating', color='license_cleansed')
fig.update_layout(title='Oceny zależnie od licencji i dzielnicy',
                  width=1100, height=450,
                  xaxis_title='',
                  yaxis_title= "Ocena",
                  legend_title = 'Licencja'
                  )

fig.update_xaxes(categoryorder='array', categoryarray=['Hotel room', 'Entire home/apt', 'Shared room', 'Private room'])

fig.show()
In [ ]:
fig = px.histogram(df, x='center_distance', color_discrete_sequence=[px.colors.qualitative.Plotly[4]])
fig.update_xaxes(title_text='Dystans do centrum [m]')
fig.update_yaxes(title_text='Częstość')
fig.update_layout(title='Histogram odległości do centrum Barcelony (Plaza de Cataluña)',
                  width=1100, height=400,)
In [ ]:
#@title Ceny i dystans w dzielnicach
lines = [
    'Zrobienie one-hot encoding dla dzielnic',
    'będzie przydatne. Jest duża zależność',
    'między ceną a dzielnicą i nie jest ona',
    'jednoznaczna z odległością do centrum.',
    '',
    'np. Ciutat Vella ma najniższe ceny, ale',
    'jej oferty są najbliżej centrum.'
]
text = '<br>'.join(lines)  ### kod do zawijania adnotacji na plotly

trace1 = go.Bar(x=grouped_distr.index, y=grouped_distr['mean_x'], ### Złożony wykres słupkowy
                name='Średnia cena za nocleg [$]',
                marker=dict(color='#00CC96'))

trace2 = go.Bar(x=grouped_distr.index, y=grouped_distr['mean_y'],
                name='Średni dystans do centrum [m]',
                marker=dict(color='#FFA15A'))

fig = make_subplots(rows=2, cols=1)
fig.add_trace(trace1, row=1, col=1)
fig.add_trace(trace2, row=2, col=1)

fig.update_layout(title='Średnia cena za nocleg i średni dystans do centrum dla ofert z danej dzielnicy Barcelony',
                  width=1100, height=700,
                  margin=dict(r=290))
fig.update_yaxes(title_text='Średnia cena za nocleg [$]', row=1, col=1)
fig.update_yaxes(title_text='Średni dystans do centrum [m]', row=2, col=1)

fig.add_annotation(dict(font=dict(size=12,),
                        width=350,
                        x=1.014,
                        y=0.90,
                        showarrow=False,
                        text=text,
                        align='left',
                        xanchor='left',
                        xref="paper",
                        yref="paper"))
fig.show()
In [ ]:
#@title Ceny i dystans w dzielnicach
lines = [
    'Zrobienie one-hot encoding dla dzielnic',
    'będzie przydatne. Jest duża zależność',
    'między ceną a dzielnicą i nie jest ona',
    'jednoznaczna z odległością do centrum.',
    '',
    'np. Ciutat Vella ma najniższe ceny, ale',
    'jej oferty są najbliżej centrum.'
]
text = '<br>'.join(lines)  ### kod do zawijania adnotacji na plotly

trace1 = go.Box(x=df['neighbourhood_group_cleansed'], y=df.query('price <=500')['price'], ### Złożony wykres słupkowy
                name='Średnia cena za nocleg [$]',
                marker=dict(color='#00CC96'))

trace2 = go.Box(x=df['neighbourhood_group_cleansed'], y=df['center_distance'],
                name='Średni dystans do centrum [m]',
                marker=dict(color='#FFA15A'))

fig = make_subplots(rows=2, cols=1)
fig.add_trace(trace1, row=1, col=1)
fig.add_trace(trace2, row=2, col=1)
fig.update_traces(jitter=0.25, marker=dict(size=2.5))

fig.update_layout(title='Ceny za nocleg (do $500 / noc) i dystans do centrum dla ofert z danej dzielnicy Barcelony',
                  width=1100, height=700,
                  margin=dict(r=290))
fig.update_yaxes(title_text='Cena za nocleg [$]', row=1, col=1)
fig.update_yaxes(title_text='Średni dystans do centrum [m]', row=2, col=1)
fig.update_xaxes(categoryorder='array', categoryarray=grouped_distr.index.tolist())

fig.add_annotation(dict(font=dict(size=12,),
                        width=350,
                        x=1.014,
                        y=0.90,
                        showarrow=False,
                        text=text,
                        align='left',
                        xanchor='left',
                        xref="paper",
                        yref="paper"))
fig.show()
In [ ]:
#@title Ceny i dystans w dzielnicach
lines = [
    'Zrobienie one-hot encoding dla dzielnic',
    'będzie przydatne. Jest duża zależność',
    'między ceną a dzielnicą i nie jest ona',
    'jednoznaczna z odległością do centrum.',
    '',
    'np. Ciutat Vella ma najniższe ceny, ale',
    'jej oferty są najbliżej centrum.'
]
text = '<br>'.join(lines)  ### kod do zawijania adnotacji na plotly

trace1 = go.Box(x=df['neighbourhood_group_cleansed'], y=df['price'], ### Złożony wykres słupkowy
                name='Średnia cena za nocleg [$]',
                marker=dict(color='#00CC96'))

trace2 = go.Box(x=df['neighbourhood_group_cleansed'], y=df['center_distance'],
                name='Średni dystans do centrum [m]',
                marker=dict(color='#FFA15A'))

fig = make_subplots(rows=2, cols=1)
fig.add_trace(trace1, row=1, col=1)
fig.add_trace(trace2, row=2, col=1)

fig.update_layout(title='Ceny za nocleg i dystans do centrum dla ofert z danej dzielnicy Barcelony',
                  width=1100, height=700,
                  margin=dict(r=290))
fig.update_yaxes(title_text='Średnia cena za nocleg [$]', row=1, col=1)
fig.update_yaxes(title_text='Średni dystans do centrum [m]', row=2, col=1)
fig.update_xaxes(categoryorder='array', categoryarray=grouped_distr.index.tolist())

fig.add_annotation(dict(font=dict(size=12,),
                        width=350,
                        x=1.014,
                        y=0.90,
                        showarrow=False,
                        text=text,
                        align='left',
                        xanchor='left',
                        xref="paper",
                        yref="paper"))
fig.show()

OGROMNE WYKRESY¶


Ze względu na fakt, że kaleido z jakiegoś powodu nie działa, i nie jestem w stanie wyświetlić wykresów jako .svg lub .png, wykresy powodują dość duże spowolnienie działania komputera. Poniżej do zminienia czy wykresy mają być wyświetlane czy nie.

In [ ]:
show_large_plots = True
In [ ]:
review_cols = [c for c in df.columns if 'review' in c]
In [ ]:
#@title Recenzje - histogramy

if show_large_plots:
  fig = make_subplots(rows=4, cols=4)

  fig.add_trace(go.Histogram(x=df['number_of_reviews']           , name=review_cols[0]  ), row=1, col=1, )
  fig.add_trace(go.Histogram(x=df['number_of_reviews_ltm']       , name=review_cols[1]  ), row=1, col=2, )
  fig.add_trace(go.Histogram(x=df['number_of_reviews_l30d']      , name=review_cols[2]  ), row=1, col=3, )
  fig.add_trace(go.Histogram(x=df['first_review']                , name=review_cols[3]  ), row=1, col=4, )
  fig.add_trace(go.Histogram(x=df['last_review']                 , name=review_cols[4]  ), row=2, col=1, )
  fig.add_trace(go.Histogram(x=df['review_scores_rating']        , name=review_cols[5]  ), row=2, col=2, )
  fig.add_trace(go.Histogram(x=df['review_scores_accuracy']      , name=review_cols[6]  ), row=2, col=3, )
  fig.add_trace(go.Histogram(x=df['review_scores_cleanliness']   , name=review_cols[7]  ), row=2, col=4, )
  fig.add_trace(go.Histogram(x=df['review_scores_checkin']       , name=review_cols[8]  ), row=3, col=1, )
  fig.add_trace(go.Histogram(x=df['review_scores_communication'] , name=review_cols[9]  ), row=3, col=2, )
  fig.add_trace(go.Histogram(x=df['review_scores_location']      , name=review_cols[10] ), row=3, col=3, )
  fig.add_trace(go.Histogram(x=df['review_scores_value']         , name=review_cols[11] ), row=3, col=4, )
  fig.add_trace(go.Histogram(x=df['reviews_per_month']           , name=review_cols[12] ), row=4, col=1, )

  fig.update_xaxes(title_text="Number of Reviews"                , row=1, col=1, title_standoff=0)
  fig.update_xaxes(title_text="Number of Reviews LTM"            , row=1, col=2, title_standoff=0)
  fig.update_xaxes(title_text="Number of Reviews L30D"           , row=1, col=3, title_standoff=0)
  fig.update_xaxes(title_text="First Review"                     , row=1, col=4, title_standoff=0)
  fig.update_xaxes(title_text="Last Review"                      , row=2, col=1, title_standoff=0)
  fig.update_xaxes(title_text="Review Scores Rating"             , row=2, col=2, title_standoff=0)
  fig.update_xaxes(title_text="Review Scores Accuracy"           , row=2, col=3, title_standoff=0)
  fig.update_xaxes(title_text="Review Scores Cleanliness"        , row=2, col=4, title_standoff=0)
  fig.update_xaxes(title_text="Review Scores Checkin"            , row=3, col=1, title_standoff=0)
  fig.update_xaxes(title_text="Review Scores Communication"      , row=3, col=2, title_standoff=0)
  fig.update_xaxes(title_text="Review Scores Location"           , row=3, col=3, title_standoff=0)
  fig.update_xaxes(title_text="Review Scores Value"              , row=3, col=4, title_standoff=0)
  fig.update_xaxes(title_text="Reviews per Month"                , row=4, col=1, title_standoff=0)

  fig.update_layout(title='Histogramy wartości dotyczących recenzji',
                    margin=dict(l=50, r=50, t=70, b=70),
                    height=900)

  fig.show()
In [ ]:
#@title Recenzje a cena


if show_large_plots:
  fig = make_subplots(
    rows=4, cols=4,
    specs=[
        [{}, {}, {}, {}],
        [{}, {}, {}, {}],
        [{}, {}, {}, {}],
        [{}, {"rowspan":1, "colspan": 3}, None, None,]
           ],
    print_grid=False)

  fig.add_trace(go.Scatter(x=df['number_of_reviews']             , mode='markers', y=df['price'] , name=review_cols[0]  ) , row=1, col=1, )
  fig.add_trace(go.Scatter(x=df['number_of_reviews_ltm']         , mode='markers', y=df['price'] , name=review_cols[1]  ) , row=1, col=2, )
  fig.add_trace(go.Scatter(x=df['number_of_reviews_l30d']        , mode='markers', y=df['price'] , name=review_cols[2]  ) , row=1, col=3, )
  fig.add_trace(go.Scatter(x=df['first_review']                  , mode='markers', y=df['price'] , name=review_cols[3]  ) , row=1, col=4, )
  fig.add_trace(go.Scatter(x=df['last_review']                   , mode='markers', y=df['price'] , name=review_cols[4]  ) , row=2, col=1, )
  fig.add_trace(go.Scatter(x=df['review_scores_rating']          , mode='markers', y=df['price'] , name=review_cols[5]  ) , row=2, col=2, )
  fig.add_trace(go.Scatter(x=df['review_scores_accuracy']        , mode='markers', y=df['price'] , name=review_cols[6]  ) , row=2, col=3, )
  fig.add_trace(go.Scatter(x=df['review_scores_cleanliness']     , mode='markers', y=df['price'] , name=review_cols[7]  ) , row=2, col=4, )
  fig.add_trace(go.Scatter(x=df['review_scores_checkin']         , mode='markers', y=df['price'] , name=review_cols[8]  ) , row=3, col=1, )
  fig.add_trace(go.Scatter(x=df['review_scores_communication']   , mode='markers', y=df['price'] , name=review_cols[9]  ) , row=3, col=2, )
  fig.add_trace(go.Scatter(x=df['review_scores_location']        , mode='markers', y=df['price'] , name=review_cols[10] ) , row=3, col=3, )
  fig.add_trace(go.Scatter(x=df['review_scores_value']           , mode='markers', y=df['price'] , name=review_cols[11] ) , row=3, col=4, )
  fig.add_trace(go.Scatter(x=df['reviews_per_month']             , mode='markers', y=df['price'] , name=review_cols[12] ,marker=dict(color='#636EFA')) , row=4, col=1, )
  fig.add_trace(go.Histogram(x=df['price']                       , marker=dict(color='#00CC96')  , name='price'         ) , row=4, col=2, )


  fig.update_xaxes(title_text="Number of Reviews"                , row=1, col=1, title_standoff=0)
  fig.update_xaxes(title_text="Number of Reviews LTM"            , row=1, col=2, title_standoff=0)
  fig.update_xaxes(title_text="Number of Reviews L30D"           , row=1, col=3, title_standoff=0)
  fig.update_xaxes(title_text="First Review"                     , row=1, col=4, title_standoff=0)
  fig.update_xaxes(title_text="Last Review"                      , row=2, col=1, title_standoff=0)
  fig.update_xaxes(title_text="Review Scores Rating"             , row=2, col=2, title_standoff=0)
  fig.update_xaxes(title_text="Review Scores Accuracy"           , row=2, col=3, title_standoff=0)
  fig.update_xaxes(title_text="Review Scores Cleanliness"        , row=2, col=4, title_standoff=0)
  fig.update_xaxes(title_text="Review Scores Checkin"            , row=3, col=1, title_standoff=0)
  fig.update_xaxes(title_text="Review Scores Communication"      , row=3, col=2, title_standoff=0)
  fig.update_xaxes(title_text="Review Scores Location"           , row=3, col=3, title_standoff=0)
  fig.update_xaxes(title_text="Review Scores Value"              , row=3, col=4, title_standoff=0)
  fig.update_xaxes(title_text="Reviews per Month"                , row=4, col=1, title_standoff=0)
  fig.update_xaxes(title_text="Cena za nocleg"                , row=4, col=2, title_standoff=0)
  fig.update_yaxes(title_text="Częstość"                , row=4, col=2, title_standoff=0)



  fig.update_layout(title='Histogramy wartości dotyczących recenzji',
                    margin=dict(l=50, r=50, t=70, b=70),
                    height=900)

  fig.show()

Model¶

To już spoza zakresu zadania, ale będzie do portfolio.

Do modelowania zostanie wykorzystany XGBoost ze względu na jego niezawodność oraz SHAP do analizy.

In [ ]:
model_cols = corr_df.columns
model_cols
Out[ ]:
Index(['host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_listings_count', 'host_total_listings_count',
       'host_has_profile_pic', 'host_identity_verified', 'latitude',
       'longitude', 'accommodates', 'bedrooms', 'beds', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'has_availability', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'instant_bookable',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'reviews_per_month',
       'private_bathrooms', 'shared_bathrooms', 'total_bathrooms',
       'refrigerator', 'tv', 'iron', 'crib', 'parking', 'elevator',
       'AC/heating', 'wifi', 'kitchen', 'washer', 'Exempt', 'HUTB',
       'Entire home/apt', 'Hotel room', 'Private room', 'Shared room',
       'center_distance'],
      dtype='object')
In [ ]:
df[model_cols].isna().sum()[:10].sort_values(ascending=False)
Out[ ]:
host_response_time           2812
host_response_rate           2812
host_acceptance_rate         2535
host_is_superhost             418
host_listings_count             2
host_total_listings_count       2
host_has_profile_pic            2
host_identity_verified          2
latitude                        0
longitude                       0
dtype: int64
In [ ]:
df['host_is_superhost'].fillna(0, inplace=True)
In [ ]:
districts = df['neighbourhood_group_cleansed'].unique().tolist()
model_cols = model_cols.tolist()
In [ ]:
model_cols.remove('host_acceptance_rate')
model_cols.remove('latitude')
model_cols.remove('longitude')
model_cols.remove('review_scores_rating')
model_cols.remove('review_scores_accuracy')
model_cols.remove('review_scores_cleanliness')
model_cols.remove('review_scores_checkin')
model_cols.remove('review_scores_communication')
model_cols.remove('review_scores_location')
model_cols.remove('review_scores_value')
model_cols.remove('reviews_per_month')
model_cols.remove('Exempt')
In [ ]:

In [ ]:
for i in districts:
  model_cols.append(i)

print('Model df shape:', df[model_cols].shape)
print('Model df shape:', df[model_cols].dropna().shape)
Model df shape: (17783, 61)
Model df shape: (14971, 61)
In [ ]:
for i in model_cols:
  print(
      i,
      df[i].isna().sum()
      )
host_response_time 2812
host_response_rate 2812
host_is_superhost 0
host_listings_count 2
host_total_listings_count 2
host_has_profile_pic 2
host_identity_verified 2
accommodates 0
bedrooms 0
beds 0
price 0
minimum_nights 0
maximum_nights 0
minimum_minimum_nights 0
maximum_minimum_nights 0
minimum_maximum_nights 0
maximum_maximum_nights 0
minimum_nights_avg_ntm 0
maximum_nights_avg_ntm 0
has_availability 0
availability_30 0
availability_60 0
availability_90 0
availability_365 0
number_of_reviews 0
number_of_reviews_ltm 0
number_of_reviews_l30d 0
instant_bookable 0
calculated_host_listings_count 0
calculated_host_listings_count_entire_homes 0
calculated_host_listings_count_private_rooms 0
calculated_host_listings_count_shared_rooms 0
private_bathrooms 0
shared_bathrooms 0
total_bathrooms 0
refrigerator 0
tv 0
iron 0
crib 0
parking 0
elevator 0
AC/heating 0
wifi 0
kitchen 0
washer 0
HUTB 0
Entire home/apt 0
Hotel room 0
Private room 0
Shared room 0
center_distance 0
Eixample 0
Sant Martí 0
Gràcia 0
Sants-Montjuïc 0
Ciutat Vella 0
Les Corts 0
Sarrià-Sant Gervasi 0
Sant Andreu 0
Horta-Guinardó 0
Nou Barris 0
In [ ]:
!pip install shap -qq
In [ ]:
import statsmodels.formula.api as smf

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error

from sklearn.linear_model import lasso_path

import xgboost
import shap
In [ ]:
df = df[model_cols].dropna()
df = df.query('price <= 500')
In [ ]:
X = df.drop('price', axis=1)
In [ ]:
X = preprocessing.scale(X)
y = df['price']
In [ ]:
alphas_lasso, coefs_lasso, _ = lasso_path(X, y)
In [ ]:
traces = []
for i in range(X.shape[1]):
    trace = go.Scatter(
        x = alphas_lasso,
        y = coefs_lasso[i],
        mode = 'lines',
        name = df.drop('price', axis=1).columns[i]
    )
    traces.append(trace)

layout = go.Layout(
    title = 'Lasso paths - Plotly',
    xaxis = dict(title = 'Log($\\lambda$)'),
    yaxis = dict(title = 'coefficients'),
    width=1400, height=1400, template='plotly_dark',
    legend = dict(orientation = 'h')
)

fig = go.Figure(data = traces, layout = layout)

fig.update_xaxes(type = 'log')

fig.show()

XGBoost¶

In [ ]:
X = df.drop('price', axis=1)
y = df['price']
In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
In [ ]:
import xgboost as xgb
In [ ]:
xgb_reg = xgb.XGBRegressor()
In [ ]:
params = {
    'objective': ['reg:squarederror', 'reg:linear'],
    'max_depth': [7, 8],
    'learning_rate': [0.1, 0.01],
    'n_estimators': [150,300,450]
}

xgb_reg.set_params(**params)
Out[ ]:
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=[0.1, 0.01],
             max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=[7, 8], max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=[150, 300, 450], n_jobs=None,
             num_parallel_tree=None,
             objective=['reg:squarederror', 'reg:linear'], ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=[0.1, 0.01],
             max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=[7, 8], max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=[150, 300, 450], n_jobs=None,
             num_parallel_tree=None,
             objective=['reg:squarederror', 'reg:linear'], ...)
In [ ]:
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(xgb_reg, params, cv=5, n_jobs=-1)
grid_search.fit(X_train, y_train)
Out[ ]:
GridSearchCV(cv=5,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    callbacks=None, colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, device=None,
                                    early_stopping_rounds=None,
                                    enable_categorical=False, eval_metric=None,
                                    feature_types=None, gamma=None,
                                    grow_policy=None, importance_type=None,
                                    interaction_constraints=None,
                                    learning_rate=[0.1, 0...
                                    max_depth=[7, 8], max_leaves=None,
                                    min_child_weight=None, missing=nan,
                                    monotone_constraints=None,
                                    multi_strategy=None,
                                    n_estimators=[150, 300, 450], n_jobs=None,
                                    num_parallel_tree=None,
                                    objective=['reg:squarederror',
                                               'reg:linear'], ...),
             n_jobs=-1,
             param_grid={'learning_rate': [0.1, 0.01], 'max_depth': [7, 8],
                         'n_estimators': [150, 300, 450],
                         'objective': ['reg:squarederror', 'reg:linear']})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
             estimator=XGBRegressor(base_score=None, booster=None,
                                    callbacks=None, colsample_bylevel=None,
                                    colsample_bynode=None,
                                    colsample_bytree=None, device=None,
                                    early_stopping_rounds=None,
                                    enable_categorical=False, eval_metric=None,
                                    feature_types=None, gamma=None,
                                    grow_policy=None, importance_type=None,
                                    interaction_constraints=None,
                                    learning_rate=[0.1, 0...
                                    max_depth=[7, 8], max_leaves=None,
                                    min_child_weight=None, missing=nan,
                                    monotone_constraints=None,
                                    multi_strategy=None,
                                    n_estimators=[150, 300, 450], n_jobs=None,
                                    num_parallel_tree=None,
                                    objective=['reg:squarederror',
                                               'reg:linear'], ...),
             n_jobs=-1,
             param_grid={'learning_rate': [0.1, 0.01], 'max_depth': [7, 8],
                         'n_estimators': [150, 300, 450],
                         'objective': ['reg:squarederror', 'reg:linear']})
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=[0.1, 0.01],
             max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=[7, 8], max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=[150, 300, 450], n_jobs=None,
             num_parallel_tree=None,
             objective=['reg:squarederror', 'reg:linear'], ...)
XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=[0.1, 0.01],
             max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=[7, 8], max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=[150, 300, 450], n_jobs=None,
             num_parallel_tree=None,
             objective=['reg:squarederror', 'reg:linear'], ...)
In [ ]:
model = grid_search.best_estimator_
print('Best hyperparameters:', grid_search.best_params_)
Best hyperparameters: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 450, 'objective': 'reg:squarederror'}
In [ ]:
predicted = model.predict(X_test)
In [ ]:
graph_pred = predicted.copy()
In [ ]:
fig = px.histogram(y_test - graph_pred, nbins=100)
fig.update_layout(width=780, height=625, template='plotly_white', bargap=0.05,)

fig.update_layout(
    width=780, height=625,
    title='',
    showlegend=False,
    xaxis_title="Residuals, różnice ceny",
    yaxis_title="Częstość")

fig.add_vline(x=0)
In [ ]:
import plotly.graph_objects as go
trace2 = go.Scatter(
    x=y_test,
    y=graph_pred,
    mode='markers',
    name='Scatter Plot')

trace1 = go.Scatter(
    x=y_test,
    y=y_test,
    mode='lines',
    name='Line Plot')

fig = go.Figure()
fig.add_trace(trace2)

fig.add_trace(trace1)

fig.update_layout(
    width=540, height=523, margin=dict(l=30, r=30, t=30, b=30),
    showlegend=False,
    xaxis_title="Cena sprawdzania",
    yaxis_title="Cena przewidziana",
    # xaxis_range=[0, 2500000], yaxis_range=[0, 2500000]
    )

fig.show()

Wyniki i SHAP¶

In [ ]:
from sklearn import metrics

MAE = metrics.mean_absolute_error(y_test, predicted)
RMSE = np.sqrt(metrics.mean_squared_error(y_test, predicted))
Score = metrics.explained_variance_score(y_test, predicted)

print(f'MAE: {MAE}')
print(f'RMSE: {RMSE}')
print(f'Score: {Score}')
MAE: 25.398717930093763
RMSE: 40.3322784967719
Score: 0.8119994928762382
In [ ]:
shap.initjs()
In [ ]:
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
[13:19:04] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
In [ ]:
# z dokumentacji
'''From official documentation: The summary plot combines feature importance with feature effects.
Each point on the summary plot is a Shapley value for a feature and an instance.
The position on the y-axis is determined by the feature and on the x-axis by the Shapley value.
The color represents the value of the feature from low to high.
Overlapping points are jittered in y-axis direction, so we get a sense of the distribution
of the Shapley values per feature. The features are ordered according to their importance.'''

shap.summary_plot(shap_values, features=X, feature_names=X.columns)
In [ ]:
cols = ['HUTB', 'accommodates', 'private_bathrooms', 'center_distance',
        'minimum_nights_avg_ntm', 'maximum_minimum_nights', 'crib', 'calculated_host_listings_count_entire_homes',
        'host_is_superhost', 'minimum_minimum_nights', 'host_listings_count', 'Entire home/apt', 'availability_30',
        'minimum_maximum_nights', 'calculated_host_listings_count_private_rooms', 'tv', 'bedrooms', 'host_total_listings_count' ,
        'calculated_host_listings_count', 'availability_60']
In [ ]:

In [ ]:
# # tu pojawiał się błąd z dla force_plot, wynika z konfliktów między wersjami
# # X = (pd.DataFrame.from_dict([wlochy_test, ]))
# shap_values = explainer.shap_values(X_test.sample()[cols])
# shap.force_plot(explainer.expected_value, shap_values, X, matplotlib=True)
In [ ]:
import pickle

pickle.dump(model, open('Barc XGB.pkl', "wb"))

# xgb_model_loaded = pickle.load(open(file_name, "rb"))
In [ ]: